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ABSTRACT 

One of a series of semiannual reports, this 
publication contains 12 articles which report the status and progress 
of studies on the nature of speech, instrumentation for its 
investigation, and practical applications. The titles of the articles 
and their authors are as follows: "Coarticulatory Organization for 
Lip-rounding in Tur)cish and English" (Suzanne E. Boyce); "Long Range 
Coarticulatory Effects for Tongue Dorsum Contact in VCVCV Sequences" 
(Daniel Recasens) ; "A Dynamical Approach to Gestural Patterning in 
Speech Prcduction" (Elliot L. Saltzman and Kevin G. Munhall) ; 
"Articulatory Gestures as Phonological Un:>.ts" (Catherine P. Browman 
and Louis Goldstein); "The Perception of Phonetic Gestures" (Carol A. 
Fowler and Lawrence D. Rosenblum) ; "Competence and Performance in 
Child Language" (Stephen Crain and Janet Dean Fodor); "Cues to the 
Perception of Taiwanese Tone^" (Hwei-Bing Lin and Bruno H. Repp); 
"Physical Interaction and Association by Contiguity in Memory for the 
Words and Melodies of Songs" (Robert G. Crowder and others); 
"Orthography and Phonology: The Psychological Reality of Orthographic 
Depth" (Ram Frost); "Phonology and Reading: Evidence from Profoundly 
Deaf Readers" (Vic)ci L. Hanson); "Syntactic Competence and Reading 
Ability in Children" (Shlomo Bentm and others); and "Effect of 
Emotional Valence in Infant Expressions upon Perceptual Asymmetries 
in Adult Viewers" (Catherine T. Best and Heidi Freya Queen). An 
appendix lists DTIC and ERIC numbers for publications in this series 
since 1970. (SR) 
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Coarticulatojy Organization for Lip-rounding in 
Turkish and English"^ 

Suzanne E. Boycet 



INTRODUCTION 

Theories of coarliculation in speech have taken 
as an axiom the notion that, by coarticulating 
segments, a speaker is aiding the efficient of his 
or her production (Liberman & Studdert-Kennedy, 
1977). Although discussion of the forces affecting 
coarticulation has tended to concentrate on 
articulatory and/or perceptual pressures operating 
within particular sequences of segments 
(Beckman & Shoji, 1984; Martin & Bunnell, 1982; 
Ohala, 1981; Recasens, 1985), there is a growing 
body of cross-linguistic work exploring the 
influence of language-particular phonological 
structure on coarticulation (Keating, 1988; Lubker 
& Gay, 1982; Magen, 1984; Manuel, 1990; Ohman, 
1966; Perkell 1986, among others). These studies 
have generally been concerned with the 
interaction of coarticulation and segment 
inventory, or coarticulation and the properties Oi 
some particular segment; the question of how 
coarticulatior« interacts with phonological rules 
has been relatively neglected G)ut cf. Cohn, 1988). 
Phonological rules, for instance, determine the 
typical structure of words in a language; we might 
speculate that languages with different 
constraints on the possible sequencing of 
segments pose different challenges to the 
articulatory planner, and thus that speakers of 
these languages would vary in the way th^ 
implement coarticulation. To cake an example, 
speakers of Turkish, a vowel harmony language 
with strict rules for the possible sequencing of 

The research in this paper was supported by NIH grants 
NS.1S617 and BRS RR^5S96 to Hasklns laboratories. 
Suggestions, comments and criticism supplied by Kathcrine 
Harris, Louis Goldstein, liiichael Studdert-Kennody, Ignatius 
Mattingly, Frederickk Bell-Borti, Sharon Manuel, Rena 
Krakow, Biarie HufTman, Joe I^rkell and John Westbuiy, and 
the JASA review process are grateftilly acknowledged Advice 
on Turkish linguistics was provided by Jaklin Komfclt and 
Engin Sezer. 



rounded and unrounded vowels, mi^^t feel more 
pressure to employ rounding coarticulation than 
English speakers, whose language freely combines 
rounded and imroimded vowels. 

Rounding coarticulation for sequences of 
rounded and unrounded vowels in English has 
been extensively studied. A number of studies 
have shown that for strings of two rounded vowels 
separated by non-labial consonants, e.g., /utu/ or 
/ustu/, both EMG and lip protrusion movement 
traces show double peaks coincident with the two 
roimded vowels plus an intervening dip or trough 
in the signal (Engstrand, 1981; Gay, 1978; 
MacAllister, 1978; Perkell, 1986). (A schematized 
version of ttiis pattern, representing EMG from 
the orbicularis oris muscle for the utterance /utu/, 
is illustrated in Figure 1.) Tnis result has been the 
focus of a good deal of controversy in recent years, 
primarily because different theories of 
coarticulation tend to treat it in different ways. 
For instance, much previous work on the control 
mechanisms underlying anticipatory coartic- 
ulation has centered on the predictions of one 
class of nodels, the *look-ahead" or ''feature- 
spreading" models (Benguerel & Cowan, 1974; 
Daniloff & Moll, 1968; Henke, 1966). Generally, 
these models view coarticulation as the migration 
of features from surroimding phones. In the most 
explicit form of this type of model, Henke's (1966) 
computer implementation of articulatory 
synthesis, the articulatory planning component 
scans upcoming segments and implements 
features as soon as preceding articulatorily 
compatible segments maJce it possible to do so.^ In 
the case of /utu/ or /ustu/, the non -labial 
consonants separating the vowels are made with 
the tongue and presumably do not conflict with 
simultaneous lip-rounding. Thus, the fact that 
troughs occur is a problem for the look-ahead 
model, because the model would normally predict 
that the rounding feature for the second vowel 
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would spread onto the preceding consonant, 
producing a continuous plateau of rounding from 
vowel to voweL 




Figun 1. Schematized veiiion of 'trough'' pattern, 
icpreaentlng EMG from the oibiculsiis muscle for the 
iitteruiceAitii/. 

Explanations of the trou^ results have varied 
widely. Following a more general suggestion of 
Kozhevnikov and Chistovich (1965), Gay (1978) 
proposed that the trough represents the resetting 
of a ''syllable-sixed articulatory unit * and that 
coarticulation is allowed to take place only within 
that unit Some evidence against this explanation 
was provided by Harris and Bell-Berti (1984), who 
found no sign of a trough in sequeMces such as 
/uhu/ and /u7u/. Addretiing himself to the 
sequences typically used in these experiments, 
Engstrand (1981) took issue with the assumption 
that alveolar consonants are compatible with full 
lip-rounding. He argued instead Aat rounding as 
found in the vowel Ai/ may interfere with optimal 
acoustic/aerodynamic conditions for these 
consonants, and that the trough may result from 
lip movement towards a less-rounded 
configuration. That such acoustic/aerodynamic 
constraints may not hold for all subjects was 
shown by Gelfer, Bell-Berti and Harris (1989), 
who reported data from a subject with lip 
protrusion and EMG Orbicularis Oris Inferior 
(001) activity for/U. Perkell (1986) hypothesized 
that a diphthongal pattern of movement for the /u/ 
vowels (i.e., fh>m a less to a more extreme lip 
position), might, in addition to acoustic and other 
constraints on Uie consonants, reduce the extent 
of rounding in the vicinity of the intervocalic 
consonant(s). In his own work, however, he found 
little evidence for diphthongal behavior in those 
subjects who showed troughs. 

Each of these proposals, it should be noted, can 
be seen as a modification to the look-ahead class of 
models, in which features of one segment spread 
to another segment if context conditions allow it 
Alternatively, a class of models known as 
*coproduction,* "Trame* or ''time-locking^ models 
(Bell-Berti ft Harris, 1981; Fowler 1980) assumes 



that coarticulation results from tempera) overlap 
between independent articulatory gestures 
bebnging to neighboring segments. In these 
models, lip movement for the the utterance /utu/ 
involves two overlapping rounding gestures 
(a&suming/t/ is not independently rounded). Thus, 
the presence of a trough is controlled by the 
degree of overlap between gesture peaks. If the 
peaks overlap one another, no trough will be 
discernible, but if the peaks are temporally 
separated from one another, the model predicts 
the occurrence of a trough. Another provision of 
these models is that gestures associated with a 
particular segment should show a stable profile 
across different segmental contexts. (It is 
acknowledged that characteristic, gesture profiles 
may be af^^cted by stress and possibly other 
prosodic contexts (Tuller, Kelso, & Harris, 1982). 
Thus, the temporal extent of coarticulation is 
predicted by the temporal extent of the gesture. 
Attempts to test the latter provision of this model 
by measuring the lag times between the acoustic 
onset of rounded vowels and related articulator/ 
activity have had varied and sometimes 
confacting results, with studies 1^ Bell-Berti and 
Harris (1979, 1982) and Engstrand (1981) 
supporting the eoproduction prediction of stable 
lag times, and studies 1^ Lubker (1981), Lvd)ker 
and Gkty (1982), Sussman anJ Westbuiy (1981) 
and Perkell (1986) indicating more variable 
behavior supportive of the look-ahead view. Thus, 
although it is unclear how the look-ahead class of 
models can account for the trough pattom, both 
types of models remain viable options for 
explaia^ng coarticulation. 

Regardless of which interpretation of the trough 
pattern is correct, however, the pattern itself has 
been found in each of the languages so far 
surveyed, appearing in English (Bell-Berti & 
Harris, 1974; Gay, 1978; Perkell, 1986), Swedish 
(Engstrand, 1981; McAllister, 1978), Spanish and 
French (Perkell, 1986). It is notable that wiese 
languages, while differing in such variables as 
syllable structure, the tendency to diphthongize 
vowels, and the presence of a phonological 
contrast between rounded and unrounded vowels, 
are alike in their tolerance for mixed sequences of 
rounded and unrounded vowels. It seemed 
plausible, at least, that the finding of troughs in 
lip-rounding activity for these languages might be 
related to this tolerance, and that a language like 
Turkish, in which words with mixed rounded and 
unrounded vowels are the exception, might show 
lip-rounding patterns other than the trough 
pattern. In particular, it seemed that Turkish 
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might provide particularly favorable conditions for 
anticipatory coarticulation of the kind predicted 
by the look-aliead class of models. In brief, the 
hypothesis was that Turkish speakers would 
exhibit plateau patterns of activity for lip- 
rounding. Tlie experiment described below was 
part of a larger study designed to lest this 
hypothesis (Boyce, 1988). A second aim of the 
experiment was to test Uie degree to which the 
coproduction model's explicit prediction of stable, 
independent gestures could be used to predict both 
English and Turkish movement patterns. 

EXPERIMENT 

Four speakers of American English and four 
speakers of Standard Turkish produced similarly 
strur;tured nonsense words designed to show the 
presence or absence of trou^s in lip^rounding. 
Corpus words for this purpose consisted of the 
series /kuktluk/, /kuktuk/, /kukuk/, /kutuk/, 
/kuluk/. Because arguments concerning the trough 
pattern often hinge on questions concerning the 
production of the intervocalic consonants in an 
unrounded environment (Benguerel & Cowan 
1974; Gelfer, Bell-Berti, & Harris 1989X the words 
/kiktlik/, /liiktik/, /kikik/, /kitik/, /kilik/, were 
included as controls. Additionally, words with 
rounded vowels followed by unrounded vowels 
/kuktlik/, /kuktik/, /kukik/, /kutik/, and /kulik/, and 
words with unrounded vowels followed by rounded 
vowels /kiktluk/, /kiktuk/, /kikuk/, /kituk/, and 
/kiluk/ were included to provide data on single 
protrusion movements. In the remainder of the 
paper, words with vowel sequences u-u, i-i, etc. will 
be referred to as u-u, m, u*i, and i-u words. The 
words with intervocalic ktl, which had the longest 
vowel-to-vowel intervals, were included to provide 
the clearest test case for the presence of a trough 
pattern. Words with shorter intervocalic 
consonant intervals were included to provide 
control information on the lip activity patterns for 
different consonants. The carrier phrase for 

Turkish speakers was "Bir daha deyiniz" 

(pronounced as phonetically spelled and meaning 

'Say _ once again*}* The English carrier 

phrase was ^ts a again.* 

English-speaking subjects included one male 
(AE) and three females (MB, AF and NM), each of 
whom spoke a variety of General American with 
no marked regional or dialectal accent. The 
Turkish speakers included one female (IB) and 
three males (AT, EG and CK). All spoke similar 
varieties of Standard Turkish. 

Additional facts about Turkish which impinge 
on the arguments made in this paper have been 



summarized in the Appendix. For the present, it is 
sufficient to note that rounding in Turkish 
operates according to a vowel harmony rule which, 
in essence, causes sequences of high vowels to 
acquire the rounding specification of the preceding 
leftmost vowel. (With minor exceptions, 
consonants do not participate in this process.) The 
effect is to produce long strings of rounded or 
unrounded vowels whose rounding is predictable 
given the first vowel in the sequence. While vowel 
harmony is a productive rule for the vast bulk of 
the lexicon there are numerous exceptions, mainly 
from Arabic and Persian borrowings. Real word 
counterparts exist for each of the vowel sequences 
in the experimental corpus, although u-i and i-i 
words conform to vowel harmony while i-u and u-i 
words do not 

For Turkish subject EG, utterances were 
random3T.ed and the randomized list repeated ^ 5 
times. Utterances in later subject runs were 
blocked, so that utterances were repeated in 
groups of five tokens (three for MB), utterances 
with the same vowel combinations were grouped 
together, and the same order of consonant 
combinations was repeated for each vowel 
combination. The order of vowel and consonant 
combinations was different for each subject, 
except that one Turkish speaker (IB) and one 
English speaker (AF) had the same order of 
presentation. 

Although Turkish has final stress, the degree of 
difference between stressed and unstressed 
syllables is much less than in English (Boyce, 
1978). Therefore, English speakers were 
encouraged to use equal stress on both syllables of 
the disyllabic nonsense words, and if equal stress 
felt unnatural, to place stress on the final rather 
than the initial syllable. Turkish sufcgects were 
given no instructions about stress. All subjects 
were instructed to speak at a comfortable rate, in 
a conversational manner. 

Instrumentation 

Movement data firom the nose, upper and lower 
lip, and jaw were obtained by means of an opto- 
electrical tracking system, similar to the 
commonly used Selcom Selspot system. The 
system consists of infrared light emitting diodes 
(LED's) attached to the structure of interest LED 
position is sensed by a photo-diode within a 
camera positioned to capture the range of LED 
movements in its focal plane. The output of this 
diode is translated by associated electronics into 
pairs of X and Y coordinate potentials for each 
LED, each with a maximum frequency response of 



ERLC 



11 



4 



500 Hz. Calibration is achieved by moving a diode 
through a known distance in the focal plane. 

LED's were attached to the subject's nose, upper 
lip, lower lip and jaw with double-sided tape. The 
nose LED was placed on the bridge of the nose, 
slightly to the left side, at a point determined to 
show Jie least speech-related wrinkling, waggling, 
etc. LED's were placed just below the vermilion 
border of the upper lip and just above the 
vermilion border of the lower lip, in a plane with 
the nose LED, at a point judged to show the axis 
of anterior-posterior movement for each 
articulator. The movement of the subject's skin 
between the lower lip and chin was observed 
during production of rounded vowels, and the jaw 
LED positioned to best reflect anterior-posterior 
movements of the mandible rather than (kin and 
musclA. Generally this was at the point of the chin 
or under it, in a plane with the hi^er LED's. 

The LED-tracking camera was positioned at 90 
degrees to the left of the sul^ecf s sagittal midline, 
at a camera-to-subject distance (21 inches) that 
provided a 10-b. -10 inch field of view. When 
centered approximately on the upper/lower lip 
junction, during maintenance of a position 
appropriate for bilabial closure, this field is large 
enough to capture the fixll range of anterior- 
posterior LED movement, as well as allowing for 
some degree of head movement 

A video camera was positioned 90 degrees to 
subject midline on the subject's ri^t, and fbcused 
as narrowly as possible, while continuing to keep 
all 4 LED's within the field cf view. Five subjects 
were videotaped throughout the experiment: 
English subjects AF and NM, and Turkish 
subjects AT and IB. An additional videotape of 
English subject MB producing the words /kitklik/, 
kuktluk/, /kiktluk/ and /kuktlik/ was obtained in a 
separate session. 

A simultauieous audio recording of the subgect^s 
speech during the experiment was made on a 
Sennheiser 'shotgun* microphone. 

The EMG recordings were made with adhesive 
surface silver-silver chloride electrodes. These 
were placed just below and above the vermilion 
border of upper and lower lips, laterally to the 
midline. According to Blair and Smith (1986), an 
electrode at this location is likely to pick up 
relatively more activity from orbicularis oris, and 
less of nearby muscles, than at other locations 
along the lip edge. Pick-up from the desired 
muscles. Orbicularis Oris Inferior (001) and 
Orbicularis Oris Superior (OOS), was checked by 
having the subject produce repeated /u/ or /V 
vowels several times in succession; if a strong 



signal was evidenced for/u/ and little or no signal 
for ^, the EMG electrode was assumed to be well- 
placed 

The EMG and movement signals, together with 
audio and clock signals, were recorded onto a 14- 
channel FM tape recorder (EMI series 7000). The 
EMG signals were rectified, integrated over a 
5 ms window, and sampled nt 200 Hz. Movement 
signals were also sampled at 200 Hz. The audio 
channel was filtered at 5^00 Hz and sampled at 
10000 Hz. By means of the simultaneous clock 
signals, data firom all channels were synchronized 
to within 2.5 ms. 

The signal firom the nose LED w2B numerically 
subtracted from respective lip and jaw signals to 
control for changes in baseline due to head 
movement Diffiirences in baseline between early 
and late portions of the experiment remained for 
some speakers, presumably due to vertical 
rotational movement of the head, in which the lips 
and nose, or jaw and nose, moved 1^ different 
amounts in space. The speakers most affected 
were English speaker AE, for whom the total 
horizontal lower lip baseline change was 
approximately 3 mm, and Turkish speakers IB, 
EG and CK, for whom the total changes were 
approximately 3.5, 6.5, and 6 mm respectively. In 
each case, baseline change reflected movement in 
the posterior direction. Baseline change for other 
speakers was within 1 mm of movement. 
Rotational movement of this ^e, in which the 
chin sank gradually toward the base of the neck, 
was confirmed in the videotape for subject IB 
(other videotaped subjects showed little baseline 
change). These baseline changes did not appear to 
affect the data in any significant way.2 

Because of recording or calibration problems, 
the upper lip movement signal and both EMG 
signals for Tuikish subject AT, the EMG OOS 
signal for Turkish subject EG, and the EMG 001 
signal for Turkish subject CK were eliminated 
from the study. Except for AT, therefore, the full 
complement of movement signals, and at least one 
EMG signal, was available for each subject. 
Recording or calibration problems also caused 
some of the 15 repetitions (tokens) planned for 
words in the experimental corpus to be discarded. 
The upper lip signal level for English subject AE 
deteriorated afler the first block of utterances. 
Thus, only the first five tokens for this signal are 
reported. 

Two acoustic reference points, or lineups, were 
identified for each token. The first, the Vi offset, 
was defined as the point where the formant 
structure disappeared from Uie waveform at the 
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onset of closure for /Su ct /t/ or the point of sudden 
amplitude change marking the change between 
the vowel and the voiced approximant /I/. The 
second, the V2 onset, was defined as the releas'' of 
the consonant occlusion for /k/ and /t/, or the point 
of amplitude change for /I/. Consonant interval 
duration measurements consisted of the time 
between these two points. The audio waveform, 
movement and EMG signals for each repeti 
tion of an utterance in the experimental corpus 
were extracted into a separate computer file. Each 
file contained a 2000 ms slice of speech with 
constant dimensions before and after tiie Vi ofbet 
point 

The main body of movement data reported here 
comes firom the anterior-posterior upper and lower 
lip signals. These signds are referred to in the 
t^xt as Upper Up X (UUC) and Lower Lip X 
( JLX). Both signals reflect lip protrusion, whidi is 
generally acknowledged to be the most reliable 
single index of lip rounding. However, because 
rounding may also involve vertical motion of the 
lips, to narrow the lip aperture, and because 
vertical movement and protrusion of the lower lip 
may be «:irected by movements of the jaw, 
anterior-posterior jaw (JX) and inferior-superior 
jaw (JY) and lip signals (ULY, LLY) were 
examined as well. 

As a rule, token-to-token variability was 
minimal in both movement and EMG signals. 
Accordingly, much of the presentation in this 
paper is based on movement and EMG traces 
produced by ensemble averaging. (Those cases 
where token-to-token vuriabiUty was greater than 
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implied by the averaged signal are men 
tioned in the text.) Signals were ensemble 
averaged using the acoustic Vi offset ttS a lineup 
puint 

Resxilts 

Turidah Speakers 

Movement and EMG signals were examined 
separately for Turkish and English subjects, with 
a view to determining characteristic movement 
and muscle activity patterns for u-u words. Figures 
2 through 6 show the averaged ULX, LLX and 
EMG traces for /kukduk/ and /kiktlik/ as produced 
hy the four Turkish speakers AT, IB, EG nnd CK. 

Overall, the /kuktluk/ movement traces for these 
subjects tended to resemble a plateau, with the 
protrusion traces being flat or slig^tlj falling over 
the course of the word. Exceptions to this pattern 
are the occurrence of a peak, or bump during the 
consonant interval in the LLX traces for subjects 
EG and CK, and the sU^t trou|^ located at the 
beginning of V2 in the ULX dgnid for subgect EG. 
Out of the three Turkish speakers with EMG da'^a 
(OOS and 001 for IB, 001 for EG and OOS for 
CK), there was no conspicuous diminution of EMG 
activity during the consonant interval. The 
general pattern was unimodal. For IB and CK, 
there was an early peak on Vi followed by a long, 
sustained offset and some indication of increased 
activity during V2. For cutgect EG, the EMG peak 
was located close to V2 in /kuktlidc/ and to Vi in 
all other u-u words. (Movement patterns, however, 
were similar^ plateau-like over EG's different u-u 
words.)3 
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Figure 2. Averaged lower lip movement traces (dashed line) plus tingle token acoustic waveform tr&ces, for /kuktluk/ 
(15 tokens) and /MktHk/ (15 tokens) as produced by Turkish speaker AT. Upwards deflecHon represents anterior 
movement. Hie vertical line indicates the lineup point for ensemble averaging, whidt was at the acot ilk offset of Vj. 
The vertical scale for the lower Up (LL) trace in 0-20 mm. Square brackits in the lower panel indicate approximate 
acoustic boundaries for thtsse words. Hie horizonUl sca.^ is time in ms. 



EKLC 



13 



6 



KUKTLUK 



UL 



IB 



KIKTUK 



EMQ008 



IB 



UL 



LL 



EMQOCI 



LL 





AUDIO 









AUDIO 







-eOO -400 -200 



200 400 600 



^ -400 -200 



200 400 600 



Figures. Avenged movement (dashed line) and EMG (toUd line) tncet, plus tingle token acoustic wavefonn tiacet, 
for /kuktlulr/ (15 tokens) and /Uktlik/ (15 tokens) as produced by Turkish speaker IB. Upwards deflection represents 
anterior movement The vertical line indicates the lineup point for ensemble averagings which was at the acoustic 
ofl»ct of Vi. Iho vcitical scale for both upper lip (UL) and lower lip (LL) traces L 0-20 mm. The vertical scale for both 
EMG OOS and EMG OOI traces is 200 ^v. Square brackets in the lower panel indicate approximate acoustic 
bot^ndaries for these words. The horizontal scale is time in ms. 
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Figure 4. Avenged movement (dashed line) and EMG (solid line) traces, plus single token acoustic waveform tnces, 
for /laiktluk/ (14 tokens) and /kikUd (15 tokens) as produced by Turkish speaker EG. Upwards deflection represents 
anterior movement The vertical line indicates the lineup point for ensemble averaging, which was at the acoustic 
offset of Vi. The vexticki scale for both upper tip (UL) and lower lip (LL) traces is 0-20 mm. The vertical scale for EMG 
OOI traces is 300 ^v. Square brackets in che lower panel indicate approximate acoustic boundaries for these words. 
The horizontal scale is Hme in ms. 
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Figure 5. Averaged movement (dathed line) and EMG (solid line) traces, plus single token acoustic wavefonn traces, 
for /kuktluk/ (8 tokens) and /kiktlik/ (15 tokens) as produced by TuxUsh speaker CK. Upwards deflection represents 
anterior movement. The vertical line indicates the lineup point for ensemble averagingr which was at the acoustic 
offset of Vi. The vertical scale for both upper lip (UL) and lower lip (LL) traces is 0-10 mm. P.^e vertical scale for EMG 
OOS traces is 150 ^v. Squan brackets in the lower panel indicate approximate acousdc boundaries for these words. 
The horizontal scale is thne in irs. 



As noted in the introduction, it is well known 
that some English speakers may protrude and/or 
narrow their lips for non-labial consonants, most 
notably /t/ (Gelfer, Boll-Berti, & Harris, 1989) and 
/V (Brown, 1981; Leidner, 1973). Looking at the 
/kiktlik/ traces, it appears that Turkish subjects 
AT, IB and EG do not produce signifi 
cant independent protrusion during the 
intervocalic consonants. Although small 
fluctuations in LLX signals may indicate some 
degree of active lower lip protrusion, these may 
also be due to lip relaxation from a retracted 
position during the flanking /i/ vowels. The 
strongest degree of movement during the /kiktlik/ 
consonant interval is seen for subject CK. It is 
hard to tell if this reflects active movement rather 
than passive relaxation, however, as CK retracts 
lip and jaw heavily for tiie sustained /a/ of Maha** 
(pronounced [daa]) in the carrier phrase (Boyce, 
1988). EMG traces for all subjects during m words 
were flat. 

English subjects in this study could be divided 
into two groups based on the appearance of their 
horizontal lower lip signals for u-u and i-i words. 
(Upper lip signals were less clearly differentiated.) 
Examples of these patterns can be seen in Figures 



6 and 7, which show the averaged ULX, LLX and 
EMG traces for /kuktluk/ and /kiktlik/ as produced 
by English subjects AE and AF. For both 
speakers, /kuktluk/ movement traces showed 
double-peaked trough patterns. A similar double- 
peaked pattern can be seen for subject AFs EMG 
001 and OOS. For subject AE, the EMG OOS 
trace is clearly double-peaked. The EMG 001 
trace also shows two peaks, but with an additional 
peak between.'* 

The right-hand panels of Figures 5 and 6 show 
averaged ULX, RLLX and EMG traces for the 
word /kiktlik/. As can be seen, for subject AF there 
is Ifttle or no movement for either lip during the 
intervocalic consonant interval, and little or no 
activity in the EMG signal. Some fluctuation in 
the movement signal for LLX is present for AE. 
Ciomparison with the presumably neutral position 
of the lips during the schwa vowel from again at 
the end of the carrier phrase (between 400 and 
600 ms after the Vi offset point) suggests that this 
may be due to lip retraction during the flanking 
vowels, with relaxation of the lips during the 
consonant interval Alternatively, some small 
active forward movement of the lower lip and/or 
jaw may be involved. 
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Figure 6. Avciaged movement (dashed line) and EMG (solid line) tiaccs, plus single token scoustk wavdbnn tnces, 
for /kukUuk/ and /kiktlik/ as produced by Engliih speaker AE Hie verHcal line indicates the lineup point for 
ensemble averaging, whidi was at the acoustic offset of Vj. Upwards deflection represents anterior movement The 
vertical scale for both upper lip (UL) and lower lip (LL) Uiccs is 0-20 mm. The vertical scale for both EMG OOS and 
EMG OOI traces is 200 ^v. Square brackets in the lower panel indicate approximate acoustic boundaries for these 
words. The upper lip traces for /kukthik/ and /kiktlik/ are t /enged from 5 tokens. Yht lower Up tnce for /kuktluk/ is 
averaged from 15 tokens, that for /kiktlik/ from 13 tokens. The horizontal scale is time in ms. 
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Ftgun 7. Avenged movement (dashed line) and EMG (solid line) traces, plus single token acousUc wavefonn traces, 
for /kuktluk/ (M tokens) and /kiktUk/ (15 tok«v.s) as produced by English speaker AF. Upwards deflecHon represents 
anterior movement. The vertical line indicates the lineup point for ensemble averaging, which was at the acoustic 
offtet of Vi. Hie vertical scale for both upper Up (UL) and lower Up (LU traces U 0-20 mm. The vertical scale for both 
pfG OOS and EMG OOI traces is 500 jiv. Square brackets in the lower panel indicate approximate acrustic 
boundaries for these words. The horizontal scale is time in ms. 



ERIC 



16 



OmtkuUdorf Orgtrntatum for Ufhroundmg in Turkish and En^ishns 



The left and right panels of Figures 8 and 9 
ahow averaged ULX, LLX and EMG traces for 
English subjects MB and NM. Looking at the 
/kuktluk/ words, we see that for both MB and NM, 
the lower lip trace shows three peaks of 
movement. The first peak is located during the 
s* of the carrier phrase (at approximately 350 
ms before the Vi offset point) and probably 
indicates protrusion associated with /s/. The 
central, and largest, peak is located during the 
intervocalic consonant interval (between the two 
vertical lines). The third peak is located during 
the second vowel At the same time, the EMG 
patterns for MB and NM's EMG 001 traces show 
trough patterns like those of English subjects AE 
and AF (NM's EMG OOS trace, like AE's 001 



trace, shows an additional peak after Yi offset). 
The upper lip pattern for subject NM also 
resembles thos^ for subjects AF and AE. Subject 
MB*s upper lip movement pattern contains two 
peaks, which correspond roughly in time to the 
central and final peaks of the lower lip trace. 

There is less apparent consistency between 
upper lip, lower lip, and EMG traces for these 
subjects than for subjects AE and AF. Looking at 
their /kiktlik/ traces, however, we see that both 
subjects MB and NM show protrusion in the lower 
lip signal during the consonant interval There is 
also some protrusion in MB*8 upper lip /kiktlik/ 
trace. The latter is likely to reflect active forward 
movement since there is no sign of upper lip 
retraction on the flanking^ vov«rels. 
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Figure 8. Avenged movement (dashed line) and EMG (solid line) traces, plus single token acoustic waveform tnces, 
for /kukthik/ (15 tokens) and /kiktlik/ (15 tokens) as produced by English speaker MB. Upwards deflection represents 
anterior movement The vertlGd line indicates the lineup point for ensemble averaging, which was at the acoustic 
offset of Vi. The veftical scale for both upper Up (UL) and lower Hp (LL) traces is 0-15 mm. The veitical scale for boUi 
EMG OOS and EMG OOI traces is 800 |iv. Square brackets In the lower panel indicate approximate acoustic 
boundaries for these words. The horizontal scale is time in ms. 
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Figure 9. Avenged movement (daehed line) and EMG (lolid lin^ tracet, plus tingle token acoustic waveform tnce«, 
for /kukduW (15 tokens) and /Uktlik/ (14 tokens) as produced by English speaker NM Upwards deflection represents 
anterior movement The vertical line indicates the lineup point for ensemble averagings which was at the acoustic 
offset of V^. The vertical scale for boft upper lip (UL) and lower Up (LL) trKes is 0-10 iaou Tht vertical scale for both 
EMG OOS and EMG OOl traces is 500 ^v. Square brackets in the lower panel indicate approximate acoustic 
boundaries for these words* The horizontal scale is time In ms. 



In Figures 10 and 11, we see the lower lip signal 
for /kiktlik/ overlaid with that for /kuktluk/ for 
English subject MB and NM. (Baseline differences 
between averaged traces have been adjusted when 
necessary, so as to visually align carrier phrase 
portions of each trace.) From this, it is clear that 
the timing of the consonant interval protru^ 
peak in /kiktlik/ is very similar to the timing of 
the central peak in these subjects' /kuktluk/ 



traces. This is most striking for MB, whose 
protrusion peak in /kiktlik/ was also similar in 
amplitude to that of /kuktluk/. For NM, the 
central peak in /kuktluk/ was slightly bimodal, 
and the protrusion peak in /kiktlik/ is more nearly 
matched in timing to the second inflection. For 
both subjects, a similar congruence of peaks can 
be seen when lower lip /kikUuk/ (shown in Figures 
20 and 21) traces are overlaid with /kiktlik/ traces. 
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FvgfiTt 10. Superimposed /kuktluk/ and /kiktlik/ avenged 
lower lip protrusion traces as produced by English 
subject MB. The vertical line is Vi offset. Vertical and 
hoiizontal scales are as in Figure 8. 



Figure 22. Superimposed /kuktluk/ and /kiktlik/ avenged 
lower lip protrusion traces as produced by English 
subject NM. The verHcal line is Vi offset Vertical and 
horizontal scales are as in Figure 9. 
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These observations suggest that the central 
pAak of the lower lip trace for /kukduk/ may be 
largely due to protrusion for one or more of the 
intervocalic consonants. The protrusion in MB's 
upper lip trace may also reflect movement for 
consonants.'^ The implication of these observations 
is that protrusion movement during the 
consonants in /kuktluk/ is independent of its 
rounded vowel context It is interesting, in this 
context, that both EMG OOS and 001 signals for 
/kuktluk/ are less strong during the consonant 
interval than during the rounded vowels. It seems 
likely that some or all of the consonant interval 
protrusion for these signals is due to jaw activity, 
perhaps as a consequence of jaw raising for the 
consonantal occlusions.^ 

Summary 

The basic question behind the experiment was 
whether English and Turkish speakers would 
show the same articulatory patterns when 
producing similar words with roimded vowels 
separated by rounding-neutral consonants. The 
data presented here indicate that they do not. 
Rather than showing the consistent trough-like 
movement and EMG patterns exhibited by 
En^ish speakers AE and AF, and reported in the 
literature for speakers of English, Swedish, 
Spanish and French, Turkish speakers show a 
consistent plateau-like pattern of movement and a 
unimodal pattern of EMG activity (with the 
possible exception of the ULX signal for subject 
EG). Equally, the Turkish subjects' patterns of 
movement and EMG contrast with the multi- 
peaked movement pattern and trough-like EMG 
patterns of English speakers MB and NM. 
Additionally, the latter two groups differ in the 
degree of consonant-related protrusion seen in i-i 
utterances. 

The look-ahead model, as modified by 
Engstrand (1981), might account for these data in 
the following way: (1) for English speakers such as 
AE and AF, full lip protrusion (i.e., to the degree 
founds rounded vowels) is prohibited during one 
or more of the intervocalic consonants used in this 
study (sec footnote 5); (2) for English speakers 
such as MB and NM, fiill lip protiusion for one or 
more consonants is required; (3) for Turkish 
speakers lip protrusion is compatible with but not 
required during these consonants, so that the 
degree of lip protrusion seen is di'rtated by feature 
spreading from the segmental context. Thus, for 
the English speakers, consonants must have some 
phonetic feature speciHcation associated 
withprotrusion, although th:s may be either plus 



or minus, while for the Turkish speakers 
consonants are allowed to have neutral 
specification for this feature. It should be noted 
that for this version of the look-ahead theory, 
because the context-independence of gestures is 
not a theme, there is no straightforward 
prediction of relationship between the u-u and m 
word data. It is possible to say, for instance, that 
for speakers such as AE and AF lessened 
protrusion on consonants is a reaction to a 
strongly protruded environment, and the behavior 
of the same consonants in an unrounded 
environment is irrelevant 

For the coproduction model, in which 
articulatory output trajectories are the result of 
combining sequences of relatively stable, 
independently oiganized gestures, data from other 
contexts such as the i-i words becomes more 
important In this model, the fact that the central 
peak in the u-u word movement traces for English 
subjects MB and NM has a counterpart in the m 
word traces is particularly relevant, as it suggests 
that the consonant-related peak in the u-u word 
traces may be independent of the gestures for the 
flanking vowels. Similarly, the relative lack of 
movement in the m word traces for speakers AE 
and AF suggests that the trough patterns in their 
u-u word traces result from combining overlapping 
vowel gestures with a small or non-existent 
consonant gesture. For the Turkish data, on the 
other hand, the lack of movement ascodated with 
the consonant(s) in the i-i word traces, together 
with the lack of a trough pattern in tie u-u word 
traces, means that a different explanation is 
called for. According to the coproduction model, 
there are several possibilities. First, gestures for 
rounded vowels in Turkish (in contrast to those for 
English) may simply combine so as to produce a 
plateau pattern. This could happen, for instance, 
if Turkish gestures were larger or if the gesture- 
to-gesture interval were shorter, such that their 
overlap results in little or no trough. 
Alternatively, Turkish may have a different 
algorithm for combining gestures. Finally, the 
peculiar phonological properties of vowel harmony 
may result in successive rounded segments being 
associated with the same protrusion gesture. 

The differences seen here between Turkish and 
English are also interesting in terms of the other 
theories mentioned above. For instance, if the 
trough in English is assumed to be a marker of 
syllable boundary, then the plateau pattern in 
Turkish may be taken to indicate that Turkish 
does not mark syllable boundaries in this way. 
Further, Turkish vowels such as /u/ are 
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(reportedly) not diphthongized, so that the lack of 
a trough in Turkish is compatible with a 
diphthongal account of the trough in English. 
Note, however, that for these theories the lack of a 
trough for English subjects MB and NM is 
somewhat problematic. It is necessary to assume 
either that the explanation does not apply to all 
English speakers or that the specification of 
protrusion for the intervening consonants 
obscures, in some fiwhion, the marking of pliable 
boundaries or the pattern of diphthongization. 

Part II 

GHven the data reported here, it is not possible 
to test either the look-ahead, the pliable marker, 
or the diphthongization theories iurthto. However, 
the (phonetic) context-free provision of the^ 
coproduction theory makes it amenable to testing 
based on articulatory behavior in different 
phonetic contexts. In essence, the logic is as 
follows: if articulator trajectories over several 
segments reflect the combination of gestures for 
each of the segments, then it should be possible to 
deduce the basic shape of each gesture from its 
behavior in different contexts. It should also be 
possible to ssmthesize articulatory contours by 
combining their elements. 

Accordingly, this section of the paper describes a 
series of tests based on the context-free provision 
of the coproduction model. In the first test, the 
consonant-related protrusion gestures seen for 
English subjects MB and NM in i-i words are 
subtracted from corresponding protrusion traces 
for u-u words. Success is a function of 
correspondence, for the same speaker, between 
subtracted traces and other u-u word traces, such 
as EMG traces, with no suggestion of consonant 
interval protrusion. In other words, because the 
coproduction interpretation of inconsistencies 
between upper lip, lower lip, and EMG signals for 
these speakers involves* the presence of an 
independent consonant-related protrusion 
gesture, removing the additional gesture should 
resolve the inconsistencies. In the second test, it is 
assumed that, if the vowel- and consonant-related 
gestures seen in the corpus are independently 
organized, then it should be possible to construct a 
viable u-u word from elements in i-u and u-i words. 
Thus, the original protrusion traces firom i-u and 
u-i words are added together and the result 
compared to original u-u word traces. Success here 
is a function of degree of correspondence between 
original u-u word signals and the synthesized 
versions. The use of subtraction and addition for 
gesture combination is based on data reported by 



Lofqvist (1989), Saltzman, Rubin, Goldstein and 
Browman (1987); Saltzman and Munhall (1989). 

Both tests require similar intersegment timing 
of consonant and vowel gestures in the i-i, u-u and 
mixed-vowel words. Although explicit measures of 
gestural timing were not made, mean intervocalic 
consonant intervals (from acoustic offset of Vi to 
acoustic onset of Vz) among one-, two- and three- 
consonant words varied by less than 35 ms for any 
English or Turkish subject. This was taken as 
evidence that speech rate and gesture phasing 
were similar enough for corresponding gestures to 
be equated. 

Subtraction Test 

lliose I-I word signals showing protrusion in the 
consonant interval consisted of upper and lower 
lip movement traces for subject MB and lower lip 
traces for subject NM. For the first test 
(henceforth called the Subtraction Test), these 
traces were subtracted, point by point, from 
corresponding u-u word movement traces. The 
theory behind this procedure was that the 
underlying movement during the consonant 
interval, i.e., the portion of movement associated 
with the vowel gestures, would be the same for 
both I-I and u-u words.*^ 

Figures 12 and 13 show the results of 
subtracting averaged /kiktlik/ from averaged 
/kuktluk/ movement traces, superimposed on 
original /kuktluk/ movement traces for these 
subjects. For comparison purposes. Figure 13 also 
shows the results of subtracting NM's averaged 
upper lip /kiktlik/ movement trace, which showed 
no sign of protrusion during the consonant 
interval, from her averaged upper lip /kuktluk/ 
trace. 

All four subtracted traces in Figure 10 fiiow a 
trough pattern. This is to be expected for the 
upper lip trace of subject NM, since her original 
/kdctluk/ traces showed a trough and her /kiktlik/ 
trace is essentially flat. It is striking, however, 
that the trough patterns for both subjects' lower 
lip traces, and for MB's upper lip trace, correspond 
more neatly to these subjects' EMG trough 
patterns (seen in Figures 8 and 9) than did the 
original u-u traces. Furiiher, NM's subtracted lower 
lip trace is nearly identical to both her subtracted 
and original upper lip trace. 

This result supports the hypothesis that 
subjects NM and MB have separate vowel and 
consonant-related h'^havior for protrusion. 
Further, it sugge^^cs that their vowel-relaced 
protrusion behavior— pvesumably connecte^i to 
articulatory instantiation of the vowel feature of 
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rounding— resembles that of other English 
speakers in being trough-like. Die presence of lip 
protrasion during consonant articulation suggests, 
not a different articulatoiy organization, but an 
additional gesture overlapping with vowel-related 
gestures. At a more general level, this result can 
be taken as support for the coproduction model 
notion that gestures are independent entities and 
for the notion that gesture combination is 
approximately additive. 

Addition Test 

For the second test Qienceforth known as the 
Addition Test), averaged upper and lower lip 
movement ihi and u-i word traces were added 
together for each English and Turkish subject 
Because the result of adding i-u and u-i word 
traces is theoretically equal to the result of adding 
u-u and ui word traces, the averaged m word traces 
were then subtracted from each added traced to 
produce a 'constructed* uhi trace. Figures 14 • 21 
show the results of this procedure for lower lip 
data from /kuktluk/, /kiktlik/, /kuktlik/ and 
/kiktluk/ (upper lip data are substantially the 
same). The top panels show averaged /kuktlik/ 
and ^ktluk/. The traces resulting from adding 
these and subtracting /kiktlik/ (henceforth known 
as constructed traces) are shown in the bottom 
panels together with superimposed original u-u 
traces. 

As these figures show, the constructed traces 
paralleled the original traces quite closely for 
three out of four English sutgects, and for two out 
of the four Turkish subjects. For English subjects 
MB and AE, in particular, the traces parallel one 
another quite closely. For Turkish subject CK the 
principal difference is the slightly Io\/er amplitude 
of the constructed trace.^ For English subject NM, 
differences are also minimal. For Turkish subject 
EG, differences are intensification of a slight 
"bump* existing in the original trace plus a 
slightly lowered amplitude of movement during 
the final /u/ vowel. For the remaining subjects, 
however, differences are more serious. English 
subject APs constructed trace shows a single 
broad peak rather than a trough as in the original 
trace. In contrast, Turkish subject IB's 
cons,!;ructed trace shows a trough rather than a 
plateau as in the original trace. Turkish subject 
ATb constructed trace, while paralleling the 
original trace during the final vowel, has a 
generally different shape from the plateau pattern 
of the original trace. 
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Figure 22. Original avenged /kuktluk^ protrusion traces 
(solid lines) with superimposed trace achieved by 
subtracting original averaged /kiktlik/ trace from 
original avenged /kuktluk/ tnce (dashed line), for 
Engibh subject MB. Upper panel shows original and 
subtncted trms for the upper lip, lower panel shows 
the same for the lower lip. The vertical line is the 
offset. Vertical and horizontal scales are as in Figure 8. 
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Fi^re 23. Original avenged /kuktluk/ protrusion traces 
(solid line) with superimposed trace made by 
subtracting original averaged /kiktlik/ trace from 
original avenged /kuktluk/ tnce (dashed line), for 
English subject NM. Upper panel shows original and 
subtncted traces for the upper lip, lower panel shows 
the same for the lower lip. The vertical line is the Vi 
offset. Vettical and horizontal scales are as in Figfire 9. 



ERLC 



21 



14 



B m/ee 





KIKTLUK 


1 L 


1 1 1 




-4C0 



-200 



200 



400 600 



Figure 14. Upper panel shows averaged low^r Up 
protrusion traces for /kuktlik/ and /kiktluk/ as produced 
by Turidsh subject AT. Lower panel shows original 
averaged protrusion trace for /kuktluk/ (solid line) with 
superimposed trace constructed by adding averaged 
traces for /kuktlik/ and /kiktluk/ and subtracting 
averaged trace for /kiktlik/ (dashed line). The vertical 
line is the Vi offset Vertical and horizontal scales are as 
In Figure 2. 
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Fi^re 16. Upper panel' shows averaged lower lip 

Erotnision traces for /kuktlik/ and /kiktluk/ as produced 
y Turkish subject EG. Lower panel shows original 
averaged protrusion trace fpr /kuktluk/ (solid line) \Wth 
superimposed trace constructed by adding averaged 
* aces for /kuktlik/ and /kiktluk/ and subtracting 
averaged trace for /kiktiik/ (dashed line). The vertical 
line is the offset Vertical and horizontal scales are as 
in Figure 4. 
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Figure IE. Upper panel shows avenged lower lip 
protrusion traces for /kuktlik/ and /kiktluk/ as produced 
by Turkish subject IB. Lower panel shoHS original 
averaged protrusion trace for /kuktluk/ (solid line) with 
superimposed trace constructed by adding averaged 
traces for /kuktlik/ and /kiktluk/ and subtracting 
averaged trace for /kiktlik/ (dashed line). The vertical 
line is the Vi offset Vertical and horizonUl scales are as 
in Figure 3. 



Figure 17. Upper panel shows averaged lower lip 

Erotnision traces for /kuktlik/ and /kiktluk/ as produced 
y Turkish subject CK. Lower panel shows original 
averaged protrusion trace for /kuktluk/ (solid line) with 
superimposed trace constructed by adding averaged 
traces tor /kuktlik/ and /kiktluk/ and subtracting 
avenged trace for /kikclik/ (dafshed line). The vertical 
line is the Vi offset Vertical aitd horlzonUl scales are as 
in Figure 5. 
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Figure 18, Upper panel ihows avenged lower lip 
piomition traces for /Icuktlik/ and /kiktluk/ aa produced 
by English subject AE Lower panel shows original 
avenged protrusion trace for /kuktluk/ (solid line) widi 
superimposed tnce constructed by adding avenged 
traces foi /kuktlik/ and /kiktluk/ and subtncting 
avenged tnce for /kiktlik/ (dashed line). The vertical 
line is the offf Vertical and horizontal scales are u 
in Figure 6. 
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Figure 20. Upper panel shows avenged lower lip 

trotiusion tnces for /kuktlik/ and /Idktluk/ as produced 
y English subject MB. Lower panel shows original 
avenged protrusion trace for /kuktluk/ (solid line) with 
superimposed tnce constructed by adding avenged 
tnces for /kuktlik/ and /kiktluk/ and sttbtncting 
avenged tnce for /kiktlik/ (dashed line). The vertical 
line is the Vj offset Vertical and horizontal scales are as 
in Figure 8. 
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Figure 19. Upper panel shows avenged lower lip 

trotrusion tnces for /kuktlik/ and /kiktluk/ as produced 
y English subject AF. Lower panel shows original 
avenged protrusion tnce for /kuktluk/ (solid line) with 
superimposed tnce constructed by adding avenged 
traces for /kuktlik/ and /kiktluk/ and subtracting 
avenged tnce for /kiktlik/ (dashed line). The vertical 
line is the Vj offset Vertical and horizonUl scales are as 
in Figure 7. 
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Figure 21. Upper panel shows averaged lower lip 

trotrusion tnces for /kuktlik/ and /Hiktluk/ as produced 
y English subject NM. Lower p.mel shows original 
avenged protrusion tnce for /kuktluk/ (soUd line) with 
superimposed tnce constructed by adding avenged 
traces for /kuktlik/ and /kiktluk/ and subtracting 
avenged tnce for .^kiktlik/ (dashed line). The vertical 
line is the Vj offset Vertical and horizonUl scales are as 
in Figure 9. 
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DISCUSSION 

The results of the Subtraction Test constitute 
relatively strong evidence for the generalization 
that English speakers produce troi^s for words 
such as /kuktluk/, and for the notion that gestures 
are independent entities whose trajectories 
combine when overlapped in time. Hie subtraction 
test results also suggest that additivity is at least 
a reasonable approximation of the way that 
gestures combine for these articulators and these 
segments. 

The results of the Addition Test are less clear. 
While the predicted and actual trajectories were 
close for some subjects, for other subjects they 
were qualitatively different While the results 
were sli^tly better for English subjects than for 
IHirkish subjects, the distinction between three 
subjects out of four (for English) vs. two subjects 
out of four, or even one out of four (for Turkish), is 
hardly great enough to warrant concluding the 
two languages are different It is also not clear 
how to interpret a lack of correspondence between 
constructed and original traces; for instance, the 
assumption of similar conditions of speech rate, 
stress and gesture phasing between averaged i-u, 
u-i, u-u and M words may not be accurate. l%e fact 
that in Tuxkish i*u and words are non-harmonic 
is also relevant It is possible, for instance, that Ai/ 
and /i/ vowels in Turkish words are always 
produced with independently organized gestures, 
but that these gestures are different in harmonic 
and non -harmonic words. A fuller discussion of 
these issues can be found in Boyce (1988). 

General Discussion 

Overall, the results of this study suggest there is 
something very different in the way Englisli and 
Turkish speakers organize articulation, at least in 
the way they use lip protrusion for rounded 
segments. The simplest index of this difference is 
the plateau pattern of protnision evinced by the 
Turkish speakers, which contrasts with the 
English patterns found here, and with \e trough 
patterns reported in the literature to date. The - 
results for English subjects, and in particular the 
results of the subtraction test for subjects MB and 
NM, confirm that the underlying articulatory 
strategy for u-u words in English follows a trou^ 
pattern. 

With regard to the competing coproduction and 
look-ahead t^ pes of models, interpretation of these 
results is both straightforward and complex. The 
strai^tforward interpretatiou i^^ as follows. Since 
the coproduction model predicts a trough pattern 
in u-u words, and English shows a trough pattern. 



then English speakers employ a coproduction 
articulatory strategy. Since the look-ahead mo^^l 
predicts a plateau pattern in u-u words, and 
Turkish shows a plateau pattern, then Turkish 
speakers employ a look-ahead strategy. Thus, 
English and Turkish have different articulatory 
strategies. 

This interpretation gains strength from the fact 
that, for each model, explaining the patterns of 
both English and TuiUsh requires an additional 
mechanism. To explain the English trough pattern 
the look-ahead model must posit additional effects 
such as syllable-boundary marking, diphthon- 
gization, or consonant-specific unrounding in a 
rounded context. Similarly, to explain the Turkish 
plateau pattern the coproduction model must posit 
an imknown effect that causes i-u and u-i vowel 
gestures to differ from those in u-u words, or an' 
unknown principle of gesture combination, or a 
loosening of the notion that gestures may be 
associated with only one segment. While any of 
these posited effects may ultimately prove to be 
valid, their status at this stage of investigation 
appears to be weak. 

. The complexity of this interpretation lies in the 
conclusion that different languages may employ 
different articulatory strategies. In some sense, 
this is to be expe> ted, since the combination of 
phonology, lexicon and syntax in different 
languages may impose entirely different 
challenges to articulatory efficiency. In fact, the 
hypothesis behind this comparison of Turkish and 
English was the notion that, in contrast to 
English, Turkish provides ideal conditions for 
articulatory look-ahead. At the same time, human 
beings presumably come to the task of language 
acquisition with the same tools and talents. The 
finding that current models of coarticulation are 
insufficient to account for language diversity 
indicates that we have not yet penetrated to the 
universal level in the way we think about speech 
production. Further research, and in particular 
more cross-linguistic research, is needed in order 
to close this gap. 
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FOOTNOTES 

*To appear in Journal of the Acoustieal Society of America. 
t Speech Communication Group, Research Laboratory of 
Bectronici, Maasachusrstts Institute of Technology. 

1 The models tend to differ in the level at which compatibihty is 
assessed. In Henke's program, the limitation was explidtly 
defined on an articulatory basis. Other investigators (Cohn, 
1988; Keating, 1988) have postulated that coarticulation spreads 
by reference lo feature ^Mdflcation at the phonological level 

2 Rotational head movement could theoretiadly afiect the results 
in two waya. First, the LEEXs might liave been moved into a 
more peripheral area of the focua field %^ere tracking is less 
accurate. Subjecta were conatantly monitored against this 
possibility during the course of the experiment, and videotaped 
experiments were diedced post-hoc Second* rotation of the 
head changes the relationship between the vertical and 
horizontal axes of the LED tracking system (the X and Y 
coordinates) and the subjects' sagittal midline. Ihu^, less or 
more of the subjects' anterior-poaterior movement relative to 
the midline may be detected. Note, however, that in all cases 



' a tub|icfei' amplitudt ol movtmtnt diangcd ovtr tht 
oouiM of tht ecpcrlmcnt this change waa mirrored in the 
oorreaponding EMG iignaL auggaating that baaaUne dianga 
%na not a iignifieant factor. 

^Intcreatin^y, for aubjada IB and dC, the EMG pattern for %hj 
words raatxnblad that for ih word*. Thi pattern for i-u words 
showed a atrong peak awodatad with V2. For aub^ect EG, on 
the odiar hand, the partem for /kuktluk/ reasoibled that for 
/kikthik/, while tibe pattetna for ahorter worda /kukukA 
/kidukA ate* rMOiMed dioae for /kuUk A /kulik/, etc EMG 
feraoaa for thaae wovda are leporlad in Bojfoa 0968)* 

^ To sona extent thia diflerenoe belwesn 006 and OCX aigiMb to 
a oonaequenoe of averaging, aa token tracsa for the two dgnala 
^wed diffeiing p ropot ti ona of double^ and trlple^peaked 
pettsma* 

^ Ptouaal of the tracea for /UtikA /kilik/ and /UUk/ suggaat 
ttiat MM shoiva some lower lip protrudon for each intervocalic 
oonaonant For MB the lower lip trace ahowa marked 
protniafon for /t/ and aone protruakm for /lA while the 
upper lip ahowa protrusion only for /I/. Each of thaae 
protrusion peaks matchea ttie conaonant-intcrval peak of the 
corresponding u-u word 

^OAer comperlaona of PC, )Y, LLX snd LLY dcnals, aa weU 99 
ob aarvationa of the vidaot ap e r^ suggnl that thaae speakers use 
forward movement of the )aw (both rotational and 
tranalationaD as wdl as Up movement to produce protrusion 
during rounded vowels. Hils topic is discussed further in 
BoyQan968). 

^Oie intractable problem wiA diia procedure ia that /I/ and 
/u/ vowria alao ahould have charaderiatic pattema. Thua, if 
retraction for /I/ vowda Is preaent.|t will be subtracted from 
protruskm for /u/ vowris at the same time the oonaonant 
protrusion ia subtracted, to theae data, the relative magnitude 
of protrusfon dwarfod that of retraction. Thus^ It waa assumed 
that the effeda of subtracting the one ou»%vei^ied the effecta of 
^eothsr. 

*Aa in the Aral teal, aome reaidue of poaeible /I/ vowel- 
assodated movement remains in these constructed Iracsa. 

'Note that the theory of independent geaturea doca not require 
that all geaturea have Identical amf^tude or be produced with 



identical force. Such a requirement would leave no room for 
the effect of fotigue or for proaodic variablea audi ss stress and 
syllable poeitioii. It la a common obeervation, for Inatanoe; that 
EMG or movement aignala for the aame word may show leaa 
amplitude at later stages during the same experiment to this 
inidy, the foct that ihj, ihi, ih and h ut t er a ncea were blocked 
separately may have cauaed some differences to overall 
Amplitude among them. 

APPENDIX 
Turkish haa eig^t vowels ^ i a e 0 0 u y/ and 
thus (like Swedish but unlike English) has vowels 
which contrast only in rounding. The consonants 
/t/, /k/ and /V are non-Iabial and phonemieally 
unrounded in both languages (Ladefoged, 1975; 
Lewis, 1967). English and Turkish have somewhat 
different patterns of allophoi^ variation for /k/ 
and /I/. In Turkish /k/and /V tend to be {it)nt or 
back accordii«2 to the front/backness of the vowel 
of the same syllable (Clements & Sezer, 1982). In 
contrast, for most English dialects pliable-initial 
/V ia front and syllable-final /I/ is back, i.e. 
velarised (Keating, 1985; Ladefoged, 1975), ^hile 
/k/ varies primarily in syllable-initial position, 
becoming front before front vowels and back 
before back vowels. (Although it is sometimes 
referred to as "consonant harmony,* the Turkish 
rule for /V and Ac/ is distinct from that for 
front/back harmony In vowels.) Neither /I/ nor /k/ 
participate in roundness harmony. The sequence 
/ktl/ is rare in both languages, but exists, c.f. 
English taetUii and Turkish /paktlar/ "pacts.* 
Neither language allows the initial cluster /tl/; 
therefore, /kuktluk/ would have the syllable 
structure /kukt-Iuk/ in both languages. 
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Long Range Coarticulatory Effects for Tongue Dorsum 
Contact in VCVCV Sequences^ 

Daniel Kecasens^ 



Th9 goal of this p«p«r it gather accurat# information about the temporal and apatial 
propartiea of tongua doraum movament in running apeach. Elactropalatographic and 
acouatical data were cbllectad to meaaura lingu^ coarticulation over time. Coartieulat<»y 
affecta wara maaaimd along VC[a]CV uttaranoaa for articulationa difiTaring in the da^nea of 
tongua doraum contact, namely, for vowels [i] va. [ij and for conaonanta [f] va. [tj; the 
contaztual phonamaa wara all poaaiUa combinationa of thoaa aame conaonanta and vowels. 
Raaulta show contraating mechaniams for anticipatory and carryover coarticulation. 
Anticipatory efbcta appear to be more ti^tly controlled ^lan carryover effects presumably 
bacauae ot ^onamic preplanning; accordin^y, geatural antagmdam in the conteztud 
phonamaa affecta the two coarticulatory typaa cttfferently. The lalcvance of iheaa data with 
raapact to theoriea of coarticulation and speech production modeling is diacisi 



LINTRODUCnON 

A large body of experimental evidence in the 
phonetics literature shows that the articulatory 
geaturea for aucceaaive phonemic units are 
coartieulated in running apeech and thua overlap 
in time. Many atudiea of coarticulation aim at 
reaching aome understanding about the 
mechanisms of phouemic realization in speech 
production. A relevant goal of this research is to 
separate articulatory events resulting from motor 
programming strategies, lh>m others related to 
the mechanical and physical properties of the 
speech production ^stem. This paper approaches 
the nature of these two components through an 
analysis of tongue dorsum coarticulation over 
time, under the theoretical assumption that the 
phonemic gestures are programmed to exhibit 
certain temporal and spatial patterns, and that 
those patterns prevail to a large extent across 
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changes in phonemic cent ct, speech rate and 
speaker (Harris, 1984; MacNeilage & DeClerk, 
1969). 

The researdi to be reported here is relevant to a 
number of significant issues. 

A. Coarticulation at a distance 

One of the issues of interest in coarticulation 
studies is the temporal donr«<in of gestural activity 
for a given phonemic unit. Early research 
suggested that jaw (Gay, 1977; Sussman et al., 
1973) and lingual activity (see references in 
section I.C) did not extend more than one or two 
segments beyond the target phoneme: more recent 
acoustic evidence shows however that such 
coarticulatoiy effects may last for three or four 
phonemes. Thus, long range temporal effects have 
been found for American English in [VtolV] (Vl-to- 
V3 effects; Huffman, 1986) and in [VbabV] (V3-to- 
VI effects; Magen, 1989) sequences. The present 
paper intends to replicate these findings through 
an analysis of coarticulation at a distance in 
VCVCV v^,terance3. Analogously to Huffman's and 
Magen stud'es, V2 was kept constant a'* [a] to 
facilitate lon{; range effects over time J ce the 
schwa is highly sensitive to coarticulation firom 
the ac(jacent segments (Recasens, 1985). Another 
go'^l of this research is to find out whether long 
range coarticulatory effects occur in other 
languages besides English. 
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B. V-and C*deper4ent coarticulation 

Evidence for transconsonantal V-to-V 
coarticulation, and for greater V-to-C effects than 
C-to-V effects (e.g., MacNeilage & DeClerk, 1969), 
suggests that consonantal gestures unspecihed for 
tongue-dorsum activity (e.g., labials and 
dentoalveolars) are produced with reference to a 
continuous V-to-V cycle (Fowler, 1984; Ohman, 
1966). The implication of this view is that the 
domain of consonant related coarticulation is more 
constrained than the domain of vowel related 
coarticulation. V-to-V effects ought to be larger 
than C-to-C effects since the phonemic string is 
organized according to underlying gestures of a 
diphthongal nature. 

To test this theory I will study coarticulatory 
effects from vowels and consonants differing in 
degree of linguopalatal contact along VCEojCV 
sequences. Thus, C-dependent effects will be 
analysed for the palatoalveolar consonant (f ] vs. 
the apicodental or apicoalveolar consonant [t], and 
V-dependent effects for the palatal vowel [i] vs. 
the non-palatal vowel [a]. If the hypothesis is 
correct, C-dependent effects ought not to extend 
beyond V2s[o]. Notice however that since the 
schwa has been denied a vocal tract target of its 
own (Catford, 1977) the availability of C-to-C 
effects across [o] would not invalidate u strong 
version of the theory. However, it would be critical 
for the theory if the temporal domain of the C- 
dependent effects exceeded that of the V- 
dependent effects. 

C Gestural antagonism 

Theories of coarticulation agree that gestural 
activity corresponding to phonemic units does not 
take place across antagonistic gestures (see 
section I.D). The problem is that no a^ar 
formulation of what is meant by gestural 
antagonism is available in most cases (Bell-Berti 
& Harris, 1981; Fowler, 1984). It is proposed in 
this paper that the degree of tongue dorsum 
coarticulation is inversely rek^ted to the 
insrolvement of the dorsum of the I^ngue in 
making a closure or a constriction. Data for 
consonants and vowels in the literature offer some 
support for this hypothesis. 

Anticipatory V-to-V effects for tongue dorsum 
activity in VCV utterances have been found by 
Butcher and Weiher (1976), and Alfonso and Baer 
(1982), but much less so or not at all by Carney 
and Moll (1971), Gay (1977), Parush et al. (1983), 
and Fametani et al. (1985). The explanation 
underlying these contrasting findings lies partly 
in the articalatory characteristics of the 



intervocalic consonant Th .s, V-to-V effects occur 
mainly for consonants for which the dorsum of the 
tongue is not involved in the making of a closure 
or a constriction ([p]: Alfonso and Baer (1982 ); Lt]: 
Butcher and Weiher (1976)). Velars, on the other 
hand, were found to block V-to-V effects to a much 
larger extent than labials and dentoalveolars in 
some of the previous studies. I have shown in this 
respect that the degree of transconsonantal V-to-V 
coarticulation is related inversel> and 
monotonically to the degree of tongue-dorsum 
contact, for more palatal-like vs. more alveolar- 
like consonants (Recasens, 1984). 

The articulatory characteristics of vowels also 
affect the extent of V-to-V coarticulation over 
time like palatal consonants, palatal vowels are 
more resistant to tongue dorsum coarticulation 
than vowels showing no constriction at the palatal 
place of articulation. Thus, smaller V-to*V effects 
on [i] than on [a] have been reported by Gay 
(1977), Butcher and Weiher (1976), and Recasens 
(1984). 

This study will look into the extent to wnich 
coarticulation is blocked during the production of 
consonants and vowels involving different degrees 
of palatal contact, namely, [f] and Lt], and [i] and 
[a]. The hypothesis that coarticulation is inversely 
related to Uie degree of tongue dorsum contact for 
the contextual gestures will be tested across 
adjacent and distant phonetic segments; for that 
purpose, all possible combinations of consonants 
[n and [t], and vowels [i] and [a] in VC[a]CV 
sequences will be submitted to experimental 
analysis. 

D. Anticipatory vs. cairyover 
coarticulation 

Another issue of interest in the present study is 
the nature of the anticipatory and the carryover 
effects. 

The sequencing of these two coarticulation types 
with respect to the target phoneme suggests that 
the anticipatory effects ought to reflect phonemic 
preplanning, and that the carryover effects ou^t 
to be mainly determined by the articulatory 
properties ,rthe target and contextual phonemes. 
However, as shown in section I.C, the anticipation 
of tongue dorsum activity is not independent of 
the production characteristics of the preceding 
phonemes. Therefore, possible candidates for 
preplanning mechanisms would be those 
regularities in onset time of articulatory activity 
that can be sliOwn to depend minimally on the 
articulatory properties of the preceding phonemic 
string. 
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1. aorias exist about what those regularities 
may be. They have been devised, however, to 
account mostly for velar and labial activity, and 
not so much for tongue movement (Sussman & 
Westbury, 1981). Moreover, they disagree as to 
whether anticipatory effects may affect an 
unlimited number of non antagonistic gestures 
(Henke, 1966), or whether they are locked in time 
to the target gesture provided that no artieulatory 
conflict is involved (Bell-Berti & Harris, 1981). 
Further refinements need to be carried out on 
theories of coarticulation to accommodate 
differences in coarticulatory activity for different 
articulators as well as differences in gestural 
antagonism associated with the contextual 
phonemes. 

Some findings reported in coarticulation studies 
are relevaift with respect to the contrasting 
nature of anticipatory vs. carryover effects for 
tongue dorsum activity: (a) antidpatory etfects are 
essentially temporal and their onset shows little 
variability, while carryover effects are essentially 
spatial and more variable (Gay, 1977; Parush et 
al., 1983); (b) carryover e. 'acts may be larger than 
anticipatory effects (Fametani et al., 1985; 
Recasens, 1984) but also smaller (Butcher & 
Weiher, 1976). Acoustic evidence for larger 
carryover vs. anticipatory effects (Fowler, 1981; 
Huffman, 1986) and vice versa (Magen, 1989) is 
also available. 

One can argue that while the findings reported 
in (a) result from the preplanned vs. mechanical 
nature of anticipatory vu, carryover coarticulation, 
respectively, those reported in (b) are dependent 
on differences in speech rate, speaker and 
phonemic context. Thus, higher speech rates are 
likely to bring about an increase in the amount of 
anticipatory coarticulation while reducing the 
degree of antagonism from the preceding gestures; 
on the other hand, slower speech rates 
presumably cause the mechanical constraints for 
the preceding gestures to overcome those 
anticipatory effects associated with the 
preplanning of the target phoneme. Therefore one 
might plausibly argue for a progressive increase of 
anticipatory vs. carryover effects as speech rate 
increases, and of carryover vs. anticipatoiy effects 
as speech rate decreases. Similarly, the relative 
salience of the anticipatoiy vs. carryover effects 
may depend on the degree of artieulatory 
constraint associated with the gestures preceding 
and following tlie target phoneme: anticipatory 
effects are likely to prevail upon carryover effects 
when the phonetic segments following the target 
phoneme involve higher gestural requirements 



than those preceding it; on the other hand, 
carryover effects should be larger than 
anticipatory effects if the phonetic segments 
preceding the phonemic target are more 
constrained than those following it. 

Within this framework, tha present study is 
concerned with the nature and the relative 
salience of the anticipatory vs, carryover modes of 
coarticulation across segments showing different 
degrees of coarticulatory resistance. 

11. METHOD 

A. Artieulatory analysis 

Electronalatography (EPG) was used to analyze 
tongue dorsum contact over time. 
Electropalatographic data were collected for all 
possible VC[olCV combinations with €»[/], [t] and 
Vs[i], [a], and stress on the last syllable; all 

sequences were embedded in a *p p* carrier 

environment 

Sufagects read the list of sequences listed in 
Table 1. Two assumptions imderlie the ordering of 
those sequences. In the first place, the degree of 
contextual interference with respect to 
coarticulatory effects ought to decrease as we 
move away from the target phoneme. Thus, e.g., 
Vl-dependent effects ought to be more heavily 
influenced by the artieulatory characteristics of 
Cl than by tiiose of the phonetic segments placed 
at the other side of [d] (i.e., C2 and V3). Secondly, 
phonetic segments involving more palatal contact 
(i.e., [f] and [i]) ought to conflict with 
coarticulatory effects to a larger extent than thos^^ 
involving a lesser degree of palatal contact (i.e., [tj 
and [a]). 



Table 1. List of sequences used in the experimenf. 
Stress was placed on the last syllable. 



1. pi/a/v 


5. pi/dtip 


9.pit9/^ 


13. piiddp 


1 pt/a/v 


6. pt/ddp 




14.ptt9C^ 


3. pi/a/ip 


7. pi/9tip 




15.pttatq) 




8. pa/9tip 


12. pats/up 


16.pilattp 



A composite account of the two assumptions 
explains the particular ordering of the utterances 
in Table 1. Thus, the utterances in the table are 
ordered for decreasing degrees of resistance to 
carryover coarticiilation associated with VI. In the 
first place, the offset of Vl-dependent effects for [i] 
vs. [a] is expected to occur earlier for sequences 1 
through 8 than for sequences 9 through 16 since 
Clsljl in the first set of utterances and Cls[t] in 
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the second. Within each set of sequences with a 
different Cl, Vl-dependent carryover effects oug^t 
to last for a shorter time when 02^1/] than vhen 
C2s[t] (sequences 1 through 4 vs. 5 through 8, and 
9 through 12 vs. 13 throue^ 16). Finally, within 
each set of sequences showing a different Cl and a 
diffiurent C2, Vl-dipendent carryover effects ou^t 
to last less long when V3=[i] than when V3s[a] 
(sequences 1 and 2 vs. 3 and 4, 5 and 6 vs. 7 and 8, 
and so on). 

A Catalan speaker from the Barcelona region 
(Re)» and two American English speakers from 
}t^w York City (Ra» Ba), repeated all utterances 
ten times with the artificial palate in place. 
Simultaneous recordings were made of the EPG 
and the acoustic signals. The American Eng^sh 
speakers were asked to avoid Japping of phonemic 
/t/ and to make an alveolar stop instead; phonetic 
perception and inspection of the EPG data 
revealed that the consonant was always [t]. 

The electropalatographic system (Rion 
Electropalatograph model DP-01) has been 
described elsewhere (Recasens^ 1984; Shibata et 
al., 1978). It is equipped with 63 electrodes 
arranged in five semicircular rows (see Figure 1) 
and allows displaying <Hie pattern of contact every 
15.6 ms. As shown in the figure^ the electrodes can 
be grouped in four articulatory regions (i.e., 
alveolar, prepalatal, mediopalatal and 
postpalatal) for data analysis. The figure dso 
shows that some electrodes are located along a 
median line; since electrodes on tiiis line belong 
neither to the right nor to the left side of the 
palate, they were assigned to both sides when 
necessary. One can see that the number of 
electrodes decreases as rows become more central; 
thus, the outermost row has 8.5 electrodes on each 
side and the innermost row has 3.5. 



B. Acoustical analysis 

The acoustical data were digitized at a sampling 
rate of lOkHz, after pre-emphasis and low-pass 
filtering. An LPC program included in the ILS 
(Interactive Laboratory System) package was 
available for spectral analysis. F2 data were 
collected at the measurement points of interest 
and averaged across repetitions. F2 
measure:nents were preferred to other 
measurements (e.g., F3 and/or Fl) since, as shown 
in the literature, there is a good correlation 
between F2 and tongue placement for vowals (e.g., 
Alfonso & Baer, 1982; Fant, 1960). 

C Measurement points in time 

The points in time selected for analysis are 
shown in Table 2 for the EPG and the F2 data. 

Table 2. Measurement points in dmefor the EPG data 
(above) and the F2 data (below). 



J I L_l I 



J I L 



1. VIi 

2. VI off. 



4. V20M. 
8.V2mklp. 

5. V2orfS. 



7.C2iiildp. 

•.VSorw. 

t.V3m:dp. 



J-J. 



1 2 

1. vimM^ 

2. V1mMpJMIi. 



3,V2mld|i. 



4 8 

8.V3mldp. 




ALVEOLAR RE3I0N 
PREPALATAL REGION 
MEDIOPAUTAL RSQION 
ROSTPALATAL REGION 



Figure 1. QcctropUte. 



Measurement points for the EPG data were 
labeled on the acoustic waveform except for the [t] 
midpoint (see Figure 2). This was due to the fact 
that) except for the [t] closure, the articulatory 
events of interest could not be located 
satisfactorily on the EPG record; thus, for 
example, it was not possible to select the frame 
corresponding to the period of maximum 
postalveolar constriction for [/] (e.g., especially 
when adjacent to the vowel [i] as exemplified in 
Figure 2). Vowel onsets and ofiftets were labeled at 
onsets and offsets of voicing; moreover, those low 
amplitude pitch pulses occurring after the onset of 
lingual closure for [t], as determined visually on 
the acoustic waveform, were excluded from the 
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vocalic period in the measurement procedure 
Vowel midpoints were labeled haliway between 
the vowel onsets and oflbets. The [J] midpoint was 
located at the midpoint between the voicing offset 
for the preceding vowel and the voicing onset for 
the following vowel. Acoustic waveforms for all 
utterances were labeled at the points in time 
indicated in Table 2 and the EPG frames 
corresponding to those acoustic labels were used 
for data analysis. 

The consonantal midp<rint for [t] was established 
at the closure midpoint Since the closure period 
for this consonant is easy to determine on the 
EPG record it was decided to use an articulatory 
rather than an acoustic criterion in this case. 
Thus, the [t] midpoint was the only frame or the 
medial frame of several successive frames showing 
all electrodes lighted up on row 1; wheh this is the 
case, maximal* contact at the front of the alveohu* 
region is achieved. 



Acoustical analysis (F2 measurements) was 
performed at different points along the vowels, as 
shown in Table 2. The main purpose of the 
acoustical analysis was to supplement the EPG 
data and, in particular, to obtain more 
information about coarticulatory effects at the 
midpoint of [a] since linguopalatal contact for this 
vowel at this particular point in time was 
practically absent for two of the three speakers 
(see Figure 5). The location of the vowel's 
midpoint is the same as that used in the EPG 
analysis. A single point along V2=[9] was chosen 
for analysis (i.e., V2 midpoint). The VI offset and 
the V3 onset labels were not placed at the voicing 
endpoints since the frequency of the F2 peak 
shows a good amount of variability at these points 
in time; instead, F2 measurements were taken at 
equidistant points between the VI midpoint and 
the VI offset (i.e., VI midp./'offs.), and between the 
V3 onset and the V3 midpoint (i.e., V3 onsTmidp.). 




Figure 2. ExempliflcaHon of the line-up procedure between the scousHc labels and the EPG frames in the case of the 
utteiance [pa|o/ipl (speaker Ra). See Table 2 for measurement poinU in time 1 through 9. 
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D. Measurement of palatal contact 

To achieve an accurate estimate of degree of 
tongue dorsum contact at the palatal regions, the 
measure of contact was the number of "on** 
electrodes in the innermost three or four rows of 
the artificial palate. The main reason for this 
choice was that differences between the degree of 
palatal contact for [i] vs. [a] and for [f] vs. [t] are 
more salient at the center of the prepalate, 
mediopalate and postpalate, than at the sides of 
these articulatory regions (see section III.A}. This 
criterion ensures that coarticulatory effects 
correspond to the most distinctive measure of 
contrast in degree of dorsopalatal contact, and are 
mazimal^^ preserved duroig the production of the 
surrounding phonemes. 

EPG data was gathered for rows 3,4,6 for 
speakers Re and Ra, and for rows 2,3,4,5 for 
speaker Ba. This speaker-dependent contrast was 
due to the fact that linguopalatal contact was 
^stematically more peripheral for speaker Ba 
than for speakers Re and Ra (see section III.A). 

III. RESULTS 

A. Linguopalatal contact and 
coarticulatory resistance for 
consonants and vowels 

1. Consonants 

Figure 3 shows patterns of linguopalatal contact 
averaged across repetitions at the midpoint of 
C2s(r] and C2s[t] for all speakers. Linguopalatal 
configurations in the figure correspond to the 
stressed CV syllables in sequences 1, 4, 13 and 16 
of Table 1. 

In FigoYe 3, tongue contact takes place between 
the contour lines and the sides of the palate; the 
central area was left untouched by the tongue. 
Contour lines connect averages of linguopalatal 
contact between rows. The electrodes on or behind 
the contour line along a given row have been ''on'* 
100% of contacts across repetitions. The degree of 
fronting for the contour line along the space 
between two adjacent electrodes on a given row is 
proportional to the contact average for the 
frontmost of the two electrodes; therefore the line 
lies closer to the frontmost electrode as the contact 
average for that electrode increases. Thus, in the 
case of the contour line for [fi] (speaker Ra), 
electrodes 1 through 6 on row 1 of the left side of 
the palate were lit up in 100% of occurrences (i.e.. 



in all repetitions) and electrode 7 in 30% of 
occurrences (i.e., in three out of ten repetitions). 

The patterns of contact for [f] show that the 
consonant is produced with a central groove along 
the postalveolar and palatal regions while there is 
a wide contact area at both sides of the palate. As 
for [t], it is produced with a complete closure at 
the front of the alveolar region (all speakers) and 
at the postalveo!^ region (speaker Ra); overall, 
the degree of lateral contact is less than for [J]. 

Figure 4 is another display of the same EPG 
frames row 1^ row. The bars in the figure stand 
for the overall number of on electrodes on each 
row. As to the issue of concern here, namely, the 
degree of tongue dorsum contact at the center of 
the palatal regions (on rows 3,4 and 5; see Figure 
Df a general trend is observed for [f] to show more 
contact than [t]. However, while this is true before 
[a], it is not so much the case before [i]. Thus, the 
claim that [f] shows more dorsal contact than [t] is 
always true in the acijacenpy of low vowels but not 
in the acijacency of high vowels. Moreover, as also 
shown in Figure 3, speaker Ba shows less central 
contact at the palatal regions than the other two 
speakers; thus, no contact at all was observed for 
[t] on rows 3, 4 and 5 . For that reason, the EPG 
data for this speaker were obtained 1^ computing 
the number of on electrodes on rows 2, 3, 4 and 5 
(see section II.D) at the expense of including a few 
electrodes located at the postalveolar region on 
row 2. 

The EPG data on Figure 3 confirms the 
hypothesis that [J] is less sensitive than [t] to 
coarticulatory effects in tongue dorsum activity 
because it involves more palatal contact. Thus, 
while the degree of contact at the palatal regions 
is hi^ly similar for [J a] and [fi] (i.e., little or no 
tongue dorsum lowering occurs for (/] before [a]), 
[ti] shows more tongue dorsum contact towards 
the median line than [ta] (i.e., during the 
production of [t], the tongue dorsum allows some 
vertical displacement as a fiinction of the a4]acent 
voweD. This trend is highly consistent for all three 
speakers. It should be pointed out that manner of 
articulation requirements may contribute to the 
i 'Sence of coarticulatory effects in tongue dorsum 
activity for [J]; as shown in the literature 
(McCutcheon et al., 1980; Wolf et al., 1976), the 
formation of the medial groove for fHcatives 
involves a hig^ degree of articulatory precision. 

In summary, adjacent vowels cause more 
variability in tongue dorsum contact for [J] than 
for [t] in accordance with the articulatory 
characteristics of these consonants. 



32 



Umg Rowgg Coariicuktoiy Effects 



25 






n^fvJ. Unguopaktal pattemt at the midpoint of Q^IJ] and C2r{tl at a function of foUowing [ij and [a] for speaken 
Rt, Ra and Ba« 
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Figure 4. Numbwof -on" electredes row by row at the midpoint of €2=1/] and C2=m ai a function of foUowing [il vi. 
[a] for speaker* Ra and Ba. " 
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2. Vowels 

Figure 5 shows patterns of linguopalatal contact 
averaged across repetitions at the midpoint of 
V3s[i] and V3=[a] for the three speakers. As for [J] 
and [t] (Figure 3), only data for the stressed CV 
Sfyllables in the symmetrical sequences are given. 

Linguopalatal patterns show a larger area of 
contact at the center of the palatal regions and 
more tongue fronting for [i] than for [a]. The 
absence of linguopalatal contact for [a] in the case 
of speakers Ba and Ra shows that the vowel is 
lower in their New York City dialect than in 
Catalan. 





Speaker Re 

SpMkMr Ra 
SpMker Ba 



C2ggntrart £Efi-data E2Ldata 



a 



in 











n 




itd|/ti 






atdj/d 






ifdJAa 




i 




i 




itdj/ta 


i 


i B 


au/Aa 


1 


1 n 



J. 



8 9 4 5 
MaaaummaffttiiQlnte 



w 



Figure 5. Linguopalatal pattema at the midpoint of V3s(i] 
and V3«[a] at a function of preceding C2=U1 va. [t] for 
•pcakcn Re, Ra and Ba. 

Phonetic variability for [i] vs. [a] as a function of 
the immediately preceding consonant is analyzed 
in Figure 6. For each of the pairs of sequences 
indicated in the figure, significant differences in 
degree of palatal contact and F2 frequency at the 
p < 0.01 level of significance are given. Effects are 
plotted at two points in time, namely, at V3 onset 
and V3 midpoint (EPG data), and at V3 
onset/midpoint and V3 midpoint (F2 data). 



Figure 6. EPG data and F2 daU on significant 
coaiticulatoiy effects upon V3s(i] and [a] from preceding 
C2>[J1 vs. [t] at the p < 0.01 level of significance 
(speakers Re, Ra and Ba). Effects are f^ven at tht VB 
onset and the V3 midpoint (EPG daU), and at the V3 
onset/midpoint and the V3 midpoint (F2 data). See Table 
2 for meuurement points in tiine. 

The figure shows much larger consonant- 
dependent coarticulatory effects at the onset (EPG 
data) and the onset/midpoint (F2 data) of [a] than 
at the midpoint of [a] and along [i]. In the case of 
[a], these findings are due partly to the fact that 
the vowel onset is closer to the consonant than the 
vowel midpoint but also to the absence of 
linguopalatal contact at the midpoint of the vowel 
for speakers Ra and Ba. That there are no effects 
along V33[i] for speakers Ra and Ba suggests that 
this vowel is more resistant to coarticulation than 
[a] because it is produced with much more palatal 
contact 

In summary, [i] shows a larger degree of palatal 
contact and is more resistant to coarticulatory 
effects than [a]. Data on vertical displacement for 
the tongue body during the production of [i] vs. [a] 
(Perkell & Cohen, 1987) are consistent with the 
findings reported here. 
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B. Long nnge coarticulatoxy effects 
1. Vowel effects 

la. Anticipatory ooarticulation 

Figure 7 displays EPG and F2 data on 
coarticulatoiy effects caused by V3 along the 
preceding VCVC string for all speakers. The 
sequences in the figure were ordered for 
decreasing degrees of resistance to V3-dependent 
antidpatory effectSi as explained in section ILA; 
thuSi it was expected that these coarticulatoiy 
effects would decrease from [ijo/i/a] through 
[atdti/a] as a function of the degree of articulatoiy 
constraint associated with the VCVC string. 

The EPG data show that the extent of 
anticipatory coarticulation is clearly dependent on 
tha degree of paktal contact for C2 in the case of 
speakers Re and Ra. Overall, when C2s|;j], 
anticipatory effects may be absent or may start 
during C2; on the other hand* when C23[t], V3 
anticipation goes back to V28[9]. Therefore, two 
essential modes of anticipatory coarticulation 



appear to take place for these two speakers: later 
than V2s[d] when C2s[r]; at V2s[9] when C2=:rt]. 
These coarticulatory trends are very robust and 
show that the anticipatory effects are conditioned 
by the degree of artdculatory constraint involved 
during the production of the preceding phonetic 
segment No V3-dependent effects are available 
before Vf ^[d]. Speaker Ba shows no effects for any 
of the sequences under analysis. 

The F2 data are consistent to a large extent 
with the EPG data. For speakers Re and Ra, V3- 
dependent anticipatory effects take place during 
V2s[9] when C2s(J] and [t]; however, at least for 
speaker Re, C2s(J] (but not C2s[t]) blocks V3-to- 
V2 effects in some cases. V3-dependent 
coarticulatory effects do not extend into VI. 
Speaker Ba shows almost no V3-to-V2 effects. 
There is no one-to-one correspondence between 
the EPG and the F2 data in this and in the next 
figures since F2 frequency is affected by changes 
in other articulatory dimensions besides 
linguooalatal contact . 
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p < 0.01 level of significance (tpeakeit Re, Ra and BaK See Table 2 for measurement points in time. 
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ThB anticipation of the V3 gesture is not a^Fected 
by the degree of coarticulatory resistance 
associated with the segments located in the VC 
string before V23[9]. Thus, the y?i'UhV2 effects 
across C2s[t] take place no later when Cls[J] than 
when Cls[t]. In fact, contrary to initial 
expectations, the EPG data for speakers Re and 
Ra show ac» earlier onset of anticipatory 
coarticulation across C2x[t] when Vls[i] than 
whenVlsM. 

In summary, for two of the speakers, the V3-to- 
V2 anticipatory effects are inversely dependent on 
the degree of tongue dorsum contact for C2; 
therefore, coarticulation takes place across C2=[t] 
but not across C2s[f ]. Moreover, effects are 



independent of the degree of articulatory 
constraint for the phonetic segments preceding [9]. 
No coarticulatory effects at a distance (i.e., 
exceeding the V3-to-V2 domain) were found; thus, 
the EPG and the F2 data show neither V3-to-Cl 
norV3-to-Vl effects. 

lb. Canyover coarticulation 

Figure 8 displays EPG and F2 data on 
coarticulatory effects caused by VI along the 
following CVCV string for all speakers. The 
sequences in the figure have been ordered for 
decreasing degrees of resistance to carryover 
coarticulation associated with VI, as explained in 
section II A 
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The EPG data for speakers Re and Ra show 
clear similarities with the EPG data on 
anticipatory effects displayed in Figure 7. Overall, 
Clsff] allows lesser effects over time than Cl^W. 
This pattern is not so clear for speaker Ba. Data 
across speakers show that Uie offset time of the 
Vl-dependent carryover effects is not as 
constrained as the onset time of the V3-dependent 
anticipatory effects; while Vl-dependent effects 
across ClslJ] may extend into [9] (and even 
fiirther), those across Cls[t] may stop at Cl. Also, 
in contrast with anticipatory effects in Figure 7, 
long range temporal effects are available; thus, 
Vl-dependent carryover effects may last until 02 
(mainly when Ol«[t], as expected) and, 
occasionally, until V3. 

Hie F2 data reveal a general trend for carryover 
coarticulatory effects to extend into [9], more so 
when Ols[t] than when Ols[f ] for speakers Re 
and Ra but not for speaker Ba. No Vl-to-V3 effects 
were found ercept in two instances for speaker Ra. 
The fact that these two cases occur in contextual 
environments which were judged tr be highly 
resistant to carryover coarticulation (i.e., 
sentences [i/ajoji] and [i/aj9|a]) suggests that 
additional explanations are needed. 

A belter account of the carryover effects 
reported in Figure 8 results from a more detailed 
analysis of the articulatory constraints involved 
during the production of the VOVOV sequences 
with OlsIJ]. The figure shows a consistent trend 
for the utterances [i/ajoti] and [i/ajota] to allow 
lesser coarticulation than the utterances [i/ajoji] 
and b/ajaja]; as pointed out above, this is also the 
case for the F2 data with regard to speaker Re. 
The pattern of canyover effects for Uie utterances 
[i/a/ati] and [i/ajata] conforms to the pattern of V3 
anticipation across 02sll] (Figure 7) in that 
coarticulation does not reach V2s[9]. The fact that 
this pattern does not apply entirely to the 
sequences [i/aJoXi] and Ci/ajsja], and that moreover 
Vl-dependent carryover effects may last longer 
than for the sequences with Olsft], suggests the 
existence of a particular production strategy. 

To investigate this issue I plotted the amount of 
dorsopalatal contact )ver time for the sequences 
[VjdjV] vs. the sequek.ces [VfotV] (speakers Ra 
and Ba) in Figure 9. As shown in the figure, Vl- 
dependent effects firom [i] vs. [a] take place during 
V2=M when 02=U] but not when C2»[t]. For the 
sequences [VjotV], an active production 
mechanism is required for the achievement of 
V2=:[a]. As compared to adjacent Ol=[/] and 
C2s[t], the production nf this vowel involves a 



noticeable decrease in degree of dorsopalatal 
contact; moreover, the linguopalatal target for [9] 
is hi^ly independent of differences in the quality 
of VI. Tho production of the sequences [VjgjV] is 
characterized by a different strategy. No 
articulatory target for the schwa is available in 
this case; instead, tongue dorsum contact proceeds 
gradually firom 01 to 02 through the vowel. The 
degree of linguopalatal contact stays hi£^ through 
the entire utterance for the [ifVIV] sequences, and 
increases fi*om VI to 02 in the [ajV/V] sequences. 
It can be concluded that the articulatory 
characteristics of 02 have a clear eiibct on the Vl- 
dependent coarticulatory trands in VOVOV 
sequences with Ol=(]]. 

The EPG data in Figure 8 1 aveals that the 02V3 
string may affect the Vl-dependent carryover 
effects in other instances. According to the initial 
hypothesis, the sequences [Vto/V] show smaller 
effects (mostly until 01 or V2) than the sequences 
[VtatV] (mostiy until V2 or 02), clearly so for 
speaker Ra and, less so, for speakers Re and Ba. 
Figure 10 illustrates this point with data on the 
degree of dorsopalatal contact over time for these 
sequences according to speaker Ra. Vl-dependent 
effects stop at the onset of V2s[9] when 02s[J], 
but extend across [g] into 02 when 02s[t]. 
Therefore, a highly constrained 02 (i.e., [f ]) blocks 
Vl-dependent coarticulatory effects and a 02 
which is unspecified for tcngue dorsum contact 
allow these effects to cake place over a long period 
of time. 

In summary, as for anticipatory effects, 
carryover effects from VI appear to be largely 
dependent on the degree of palatal contact for the 
aoljacent phonetic segment (i e., 01). Differently 
from anticipatory effects, the offset of carryover 
coarticulation is less constrained in time, may 
extend beyond V2=[9], and is dependent on the 
articulately characteristics of the phonemic string 
on the other side of [9]. 

2. Consonantal effects 

2a. Anticipatory coarticulation 

Figure 11 displays EPG and F2 data on 
coarticulatory effects caused by 02 along the 
preceding VOV string for all speakers. The 
sequences have been ordered for decreasing 
degrees of resistance to anticipatory effects 
associated with 02, as explained in section II.A. 

According to the Hgure, C2-dependent 
anticipatoiy effects over time for speaker Ra occur 
more frequently for sequences ending in V3s[a] 
than for those ending in V3s[il. 
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Figure 9. Vl-depcndcnt canyover effecU in tongue donum conUd over time for the sequences IVfofVi vs, rVfotVl 
accoiding to speakers lUuidBa. See Table 2 for messuiement points in time. ' J J J ' J 
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Figure 10. Vl-depcndent canyover effecU in tongue dorsum contact over time for the sequences [Vto/V] vs. [YtotV] 
according to speaker Ra. Sec Table 2 for measurement points in time. 



The figure also shows that the EPG data for 
C2sU] and [t] before V3=[i] for the same speaker 
are not significantly different at the C2 midpoint 
(i.e., point in time 7). Anticipatory effects from 
C2=:(/] vs. [t] occur mainly when V3=[a] since 
there are clear differences in degree of tongue 
dorsum raising for the consonant in this context 
(see Figure 4); consonant-dependent anticipatory 



effects are small or noh existent when V3=[i] since 
this vowel causes the dorsum of the tongue for [t] 
to show a similar amoimt of raising to that for [f ] 
(see Figure 4). This finding shows that the 
availability and extent of anticipatory 
coarticulation is dependent on the particular state 
of the articulators during the production of the 
target phoneme. 
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Overall, for til three speakers, a strong 
tendency is obi erved for C2-dependent effects to 
begin as early as possible daring V2s[9] but rarely 
during the phonetic segments preceding V2=[9]. 
Effects do not appear to be dependent on the 
articulatory characteristics of Cl; thus, for 
eumple, the onset time of the C2-dependent 
anticipatory effects does not occu later when 
ClsCn than when Cls[tL Data for speaker Ba look 
somewhat different from data for speakers Re and 
Ra in this respect; in this case, more facilitation of 
the C2-dependent anticipatory effects may take 
place when Vls[a] than when Vls[i]. 

The F2 data show no C2-dependent effects 
before y2^[H. In general, those utterances 
showing little coarticulation in linguopalatal 
contact allow no C2-to-[9] effects in F2 frequen^. 

In summary, as for V2-dependent anticipatory 
effects, C2-dependent anticipatory effects are 
constrained to occur at the onset of V23[9], more 



so for speakers Re and Ra than ibr speaker Ba« 
Overall, they also are highly independent of the 
articulatory characteristics of distant segments in 
the string. 

2b. Carryover coarticulation 

Figure 12 displays EPG and F2 data on 
coarticulatory effects caused by Cl along the 
following VCV string for all speakers. The 
sequences in tho figure have been ordered for 
decreasing degrees of resistance to carryover 
effects assodated with Cl, as explained in section 

Analogously to the data in Figure 11, the EPG 
data at the Cl midpoint show significant 
differences between [f ] and [t] after V? s[a] but not 
after Vl=[i], for speakers Ra and E . and, to a 
lesser extent, fot speaker Re. As expected, the 
most reliable Cl-dependent effecta over time occur 
whenVl«[a], 
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Fig 1 11. EPG and F2 data on significant C2-dependent anticipatory effecta along the entire va*! utterance at the 
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f-^r J^i^i^ ^ •Igniflcaiil Cl^ependent canyover effects along the entire lolCV uttei^nce at the 0 < .01 

level of significance fepeakeis R Ra and Ba). See Table 2 for measurement points in time. « mep < .ui 

According to the figure, the ofTset time of the 
carryover effects appears to be affected by tlie 
articiilatory characteristics cf C2; thus, the EPG 
data (all speakers) and the F2 data (speakers Re 
and Ra) show a general tendency for 
coarticulation to last longer in the sequences 
[aCotV] than in the sequences [aCo/V]. Long 
range effects from Cl are more salient than those 
from C2 in Figure 11; accordingly, the EPG and 
the F2 data in Figure 12 show instances of Cl-to- 
C2 and Cl-to-V3 coarticulation, mostly when 
C2=[t] and V3=[a], expected. 

In summary, witji regarc to vowel-dependent 
carryover vs. anticipatory effects, consonant- 
dependent carryover effects are more variable, 
may last longer and are more affected by distant 



phonetic segments than consonant-dependent 
anticipatory effects. 

IV. DISCUSSION AND CONCLUSIONS 

It was found in this paper that the amount of 
coarticulation in tongue dorsum contact varies 
inversely to the degree of palatal contact in 
adjacent or intervening segments. Data reported 
in the Results section (III.A.1 and in.A.2) show 
larger variability in tongue-dorsum contact and in 
F2 frequency for [t] vs. [j] as a function of vocalic 
context, and for [i] vs. [a] as a function of 
consonantal context 

Data on long range coarticulatory effects show 
that the onset of V- ..nd C- dependent anticipatory 
coarticulation is quite fixed in time, is conditioned 
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by the degree of tongue dorsum contact for the 
immediately preceding segment in the string (i.e., 
C2), and is speaker-dependent Thus, the onset of 
the V3-dependent effects occurs during C2s[f ], but 
during preceding V2=[a] if C=[t] (two speakers); 
for these same speakers the onset of the 
consonant-dependent effects takes place at the 
onset of V23[q]. Speaker Ba, on the other hand, 
shows no V3-dependent effects, and C2- 
dependent effects originating at the onset of bl 

As compared wita anticipatory efTects, long 
range carryover effects from VI and C2 were 
found to be quite more variable (i.e., ofTset times 
were not as regular as onset times for anticipatory 
effects) and to extend further in time (on C2 and 
V3) when distant context allowed so. Moreover, 
carryover effects were more sensitive than 
anticipatory effects to the degree of articulatory 
constraint for the contextual gestures. As for 
anticipatory coarticulation effects, they were 
conditioned by the degree of tongue dorsum 
contact for the acQacent segment (e.g., offset of 
vowel-dependent coarticulation oc<;urs earlier 
when C1=[J] than when Cl=[t]). However, 
differently from anticipatory effects, distant 
segments were found to counteract or facilitate 
carryover coarticulation; thus, the Vl-dependent 
efifects last longer in sequences [VtotV] vs. [Vto/V] 
and [Vjo/V] vs. [VjatV], and the offset of the Cl- 
dependent effects takes place earlier when C2=(r] 
than when C3?[ti. 

These data reveal that anticipatory and 
canyover effects are highly sensitive to the degree 
of dorsopalatal constriction exhibited by the 
adjacent phonetic segments in the string. 
Moreover, they suggest that, to a large extent, the 
degree of spreading of gestural activity over time 
may depend on some measure of degree of 
gestural antagonism in the adjacent and/or 
distant phonetic segments; it may be that the 
temporal domain of a gesture increases or 
decreases as the contextual gestures become more 
or less neutral, respectively, with respect to its 
production proporties. Research reported in this 
study and in the recent literature (see 
Introduction section) provides good evidence for 
the notion of degree of gestural antagonism in the 
case of tongue dorsimi raising towards the palate. 
Thus, the onset of gestural activity appears to 
take place earlier as the degree of gestural conflict 
for the preceding consonants decreases, for 
ff]>[il>[p]. Also, Vl-to-V3 or V3-to-Vl effects in 
VCVCV sequences have been reported to occur 
only when V2 is highly neutral with respect to 
tongue body coarticulation (i.e., fo]); consistently 



with this view, V-to-V effects across labial and 
dentoalveolar consonants in VCV sequences may 
not take place when the fixed vowel is not [a] (see 
section I.C for references). 

This issue of articulatory antagonism is crucial 
to elucidate the temporal domain of the 
coarticulatory effects (and of phonemic gestures in 
general), and the relative dominance of 
anticipatory vs. carryover coarticulation. In my 
view ^ese two aspects are related as follows: the 
more the contractual gestures approach absolute 
neutrality with respect to the target gesture, the 
more anticipatory effects prevail over carryover 
effects. This situation takas place when large 
amounts of undershoot a/e available (e.g., in fast 
speech) and/or when preceding segments can 
adapt easily to the target gesture (e.g., to tongue 
positioning for an upcoming vowel during the 
production of one or more labial consonants). In 
such extreme circumstances anticipatory 
coarticulation may extend more than two 
segments in advance; thus, as mentioned in the 
Introduction section, Magen (1989) found V3-to- 
VI effects in [VbobV] sequences. Otherwise, as for 
/t/ in the present study (also for /V in Kuf!inan, 
1986) and at slow speech rates (probably in the 
case of speaker Ba, whose speech was the slowest 
of all speakers analyzed in this study), canyover 
effects extend over largei periods of time than 
anticipatory effects. Thus, in [VtstV] sequences, 
V- and C-dependent carryov ^r effects were found 
to extend up to three (and to a lesser extent four) 
segments from the target and to biock possible V- 
and C- dependent anticipatory effects before V2. 1 
suggest that this is the case because [t] is not 
completely neutral with respect to tongue dorsum 
activity; indeed, the raising of the front of the 
tongue also involves some raising of the tongue 
dorsum. 

While differences in the temporal extent of 
coarticulatory effects differing in directionality 
may be shown to depend on the degree of 
articulatory constraint for the adjacent gestures, a 
different account is needed to explain why the 
onset of anticipatory coarticulation is more fixed 
than the offset of carryover coarticulation. Thus, 
for example, no requirements on articulatory 
control may be called forth to explain why 
carryover effects are more dependent than 
anticipatory effects upon the articulatory 
characteristics of the distant phonemes in the 
string. Again, even though the articulatory 
configuration of V2s[a] in the utterances of the 
present study is roughly equally affected by Cl 
and 02, the extent of Vl-dependent coarticulation 
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is affected by the articulatory properties ol C2, but 
the extent of V3-dependent coarticulation is not 
affected by the articulatory properties of Cl. An 
obvious interpretation of this finding is that, while 
carryover effects may adapt to the articulatory 
requirements imposed upon the realization of 
distant phonetic segments, anticipatory effects do 
liot do 80. Instead, the onset of the anticipatory 
effects is required to occur at a specific moment in 
time or during the production of a given preceding 
phoneme. Together with data from other sources 
(Magen, 1989; Parush et al., 1983; Recasens, 
1987) this may be the sort of regularity that we 
are looking for to state that anticipatory effects 
reflect motor planning to a larger extent than 
carryover effects. 

The findings reported in this paper provide 
some support for a time-locked model of 
anticipatory coarticulation. Thus, tlie onset of V3- 
dependent effects across non-antagonistic C2s[t] 
was always found to occur at V2s[a], and the onset 
of C2-dependent effects across non-antagonistic 
V2s[d] was mostly detected at the onset <^ V2s[a]. 
Clearly, more work is needed to determine how 
the onset time of articulatory activity for a given 
gesture changes as a fixnction of the degree of 
gestural antagonism associated with the 
preceding phonemes in the string. 

3ome data reported in this and ether 
coarticulation studies are relevant to Ohman s V- 
to-V model of coarticulation. Firstly, V-to-V effects 
barely take place if the intervening t^insonant is 
highly constrained and/or if the fixed 
transconsonantal vowel is not highly neutral with 
respect to the target vowel gesture (Recasens, 
1987). Moreover, contrasting speaker-dependent 
behaviors may also be available; for example, it 
was found in Uie present study that V3-depend^t 
antinpatory effects for speaker Ba did not reach 
the immediately preceding consonant (i.^., C2). 
Thirdly, as stated earlier, the temporal domain of 
anticipatory coarticulation may exceed two 
segments, if the contextual gestures are highly 
neutral with respect to the targ^'t gesture; 
moreover, the particular nature of the carryover 
effects also facilitates long range effects beyond 
V2s[a]. Finally, as shown in figures 8 and 12 of 
this paper, V- and C- dependent canyover effects 
may land on a consonant (i.e., C2) showing a low 
degree of articulatory constraint. In agreement 
with the hypothesis of a V-to-V mode of 
production, in most cases V-dependent effects 
were found to exceed C-dependent effects. Thus, 
for two speakers, no C2-to-Cl anticipatory effects 



(even when Cl=[tl) were available but only V3-to- 
V2 effects; of course, it can be claimed that [9] is 
more neutral than [t] with regarr* 'x> coarticulation 
in tongue dorsum activity, and that C2-to-Cl 
effects might have been found had Cl been [pj 
instead of [t]. For speaker Ba, however, while C2- 
to-[o] effects occur systematically, no V3- 
dependent effects were found either upon V2=[9] 
or even upon C2s[t]. It can be concluded that the 
availability of a V-to-V mode of production may 
vary in view of the strong and subtle dependence 
of coarticulation on gestural antagonism in the 
contextual phonemes. 
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A Dynamical Approach to Gestural Pattemipg in Speech 

Production* 



Elliot L. Saltzman and Kevin G. M iinhallt 



In this article, we attempt to racMidla tha linguistic hypotheaia that apeech involvaa an 
underlying aequendng of abstract, diacrete, contazt-independent unite, with the empirical 
obaervation of continuous, context-dependent interleaving of articulatory movemente« To 
thia end, we first review m previously proposed taak*dynamic modal for the coordinatiMi 
and contrd cS the speech articulators. We then describe an eztenaion of thia model in 
which invariant mpneh unite (gestural primitives) are identified with contaxt*independent 
sete of parametora in a dynamical systsm having two flinctionally Satinet but interacting 
levela. Hie inUrgutund level ia d^ned according to a aet of acfivo<um coordinates; ti^e 
inUrartieulaior level is defined according to boA nuxUl artieubUar and tract-variable 
coordinatea. In the firamework of thia extended model, coproduction efiecte in apeech are 
deacribed in terma of the blending dynamics defined among a aet of teca^jorally overlapping 
active unite; the relative timing of apeech gestures is formulated in terma of the serial 
dynamics that ahape the temporal patterning of onsete and offsete in unit activations. 
Implications of this approach for certain phonological issues are discussed, and a range df 
relevant experimental data on speech and limb motor control ia reviewed. 



INTRODUCnON 
The production of ape«^ch is portrayed 
traditionally as a combinatorial process that uses 
a linrited set of units to produce a very large 
number of linguistically Vell-fonned" utterances 
(ag., Chomsky & Halle, 1968). For example, /msd/ 
and /dsm/ are characterized by different 
underlying sequences of the hypothesized 
segmental unite /m/, /d/, and /a/. These types of 
speech unite are usually seen as discrete, static, 
and invariant across a variety of contexts. 
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Putatively, such characteristics allow speech 
production to be gc *erative, because unite of this 
k'nd can be concatenated easily in any order to 
form new strings. The reality of articulation, 
however, bears little resentblance to this 
depiction. During speech production, the shape of 
the vocal tract changes constantly over time. 
These changes in shape are produced by the 
movemente of a number of relatively independent 
articulators (e.g., velum, tongue, lips, jaw, ete.). 
For example. Figure 1 (from Krakow, 1987) 
displays the vertical movemente of the lower lip, 
jaw and velum for the utterance ''it's a /bamib/ 
sid«* The lower lip and jaw cooperate to 
alternately close and open the mouth during 
/bamib/ while, simultaneously, the velum 
alternates between a closed and open posture. It is 
clear from this figure that } articulatory 
patterns do not take the form of discrete, abutting 
unite that are concatenated like beads on a string. 
Rather, the movemente of different articulators 
are interleave^d into a continuous gestural flow. 
Note, for example, that velic lowering for the /m/ 
begins even before the lip and jaw complete the 
bilabial opening from the /b/ to the /a/. 
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/bam i b / 



ACOUSTICS 



VELUM 



LOWER LIP 



JAW 






Figure 1. Acouitic wivefoim and optoelcctronlcally monitored vertical components of the articulatoiy hajcctoriet 
accompanying the utteiancc ^ft a /bamih/ rid*. (Fimn Kiakow^ 1987; naed witti authoi'a penniMion). 



In this article, we focus on the patterning of 
speech gestures,^ drawing on recent developments 
in experimental phonology/phonetics and in the 
study of coordinated behavior patterns in multi* 
degree-of-freedom dynamical systems. Our key 
questions are the following: How can one best 
reconcile traditional linguistic analyses (discrete, 
contezt-independent units) with experimental 
observations of speech articulation and acoustics 
(continuous, context-dependent flows)? How can 
one reconcile the hypothesis of underlying 
invariance with the reality of surface variability? 
We tiy to answer these questions by detailing a 
specific dynamical model of articulation. Our focus 
on dynamical systems derives from the fact that 
such systems offer a theoretically unified accoimt 
of: a) the kinematic f nns or patterns displayed hy 
the articulators during speech; b) the stability of 
these forms to external perturtmtions; and c) the 
lawful warping of these forms due to changing 
system constraints such as speaking rate, 
casualness, segmental composition, or 
suprasegmental stress. For us the primary 
importance of the work lies not so much in the 
details of this model, but in the problems that can 
be delineated within its frameworTc.^ It has 
become clear that a complete answer to these 
questions will have to address (at least) the 
following: 1) the nature of the gestural units or 
primitives themselves; 2) the articulatory 
consequences of partial or total temporal overlap 
icoproduction) in the activities of these imits that 



results from gestural interleaving; and 3) the 
serial coupling among gestural primitives, i.e.« the 
processes that govern intergestural relative 
timing md that provide intergestural cohesion for 
higher-order, multigesture units. 

Our central thesis is that the spatiotemporal 
patterns of speech emerge as behaviors implicit in 
a dynamical system with two functionally distinct 
but interacting levels. The intergestural level is 
defined according to a set of activation 
coordinates; the interarticulator level is defined 
according to both model articulator and tract- 
variable coordinates (see Figure 2). Invariant 
gestural imits are posited in the form of relations 
between particular subsets of these coordinates 
and sets of context*independent dynamical 
parameters (e.g., target position and stifihess). 
Contextually*conditioned variability across 
different utterances results from the manner in 
which the influences of gestural units associated 
with these utterances are gated and blended into 
ongoing processes of articulatory control and 
coordinaUon. The activation coordinate of each 
imit can be interpreted as the strength with which 
the associated gesture ''attempts" to shape vocal 
tract movements at any given point in time. The 
tract-variable and model articulator coordinates of 
each unit specify the particular vocal-tract 
constriction (e.g., bilabial) and set of articulators 
(e.g., lips and jaw) whose behaviors are directly 
affected by the associated unit's activation. The 
intergestural level accounts for patterns of 
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relative timing and cohesion among the activation 
intervals of gestural units that participate in a 
given utterance, e.g., the activation intervals for 
tongue-dorsum and bilabial gestures in a vowel- 
bilabial-vowel sequence. The interarticulator level 
accounts for the coordination among articulators 
evident at a given point in time due to the 
currently active ^ of gestures, e.g., the 
coordination among lips, jaw, and tongue during 
periods of vocalic and bilabial gestural 
coproduction.^ 





Intergeatural 

Coordination 

(gestural activation variables) 












Interaittculatory 
Coordination 

(tract variabies; 
model ailiculatory variables) 









Figure 2. Sdicnatic iUuttnition of the proposed two*level 
dynamical model for meedi producdoiv with assodated 
coordinate syttcms isdicattd. The daikcr anow from the 
intergestunl to the interarticulator level denotes the 
feedforward flow of gestural activation. The lighter 
arrow indicates feedback of ongoing tract-variable and 
model articulator state infonnation to the intergestural 
leveL 

In the following pages we take a stepwise 
approach to elaborating upon these ideas. First, 
we examine the hypothesis that the formation and 
release of local constrictions in vocal tract shape 
ar^ governed by active gestural units that serve to 
organize the articulators temporarily and flexibly 
into functional groups or ensembles of joints and 
muscles (i.e., synergies) that can accomplish 
particular gestural goals. Second, we review a 
recent, promising extension of this approach to the 
related phenomena of coarticulation and 
coproduction (Saltzmar >, Rubin, Goldstein, & 



Browman, 1987). Third, we describe some recent 
work in a connectionist, computational framework 
(e.g., Grossberg, 1986; Jordan, 1986, in press; 
Lapedes & Farber, cited in Lapedes & Farber, 
1986; Rumelhart, Hinton, & ^^illiams, 1986) that 
offers a djnMtmical account of intergestural timing. 
Fourth, we examine the issue of intergestural 
cohesion and the relationships that may exist 
between stable multiunit ensembles and the 
traditional linguistic concept of phonological 
segments. In doing so, we review the work of 
Browman and Goldstein (1986) on their 
articulatory phonology. Fifth, and finally, we 
review the influences of factors such as speaking 
rate and segmental composition on gestural 
patterning, and speculate on the farther 
implications of our approach for understanding 
the production of speech. 

Gestural primitives for speech: 
A dynamical framework 

Much theoretical and empirical evidence from 
the study of skilled movements of the limbs and 
speech articulators supports the hypothesis that 
the 'significant informational units of action" 
(Greene, 1971, p. xviii) do not entail rigid or hard- 
wired control of joint and/or muscle variables. 
Rather, these unit^ o? coordinative structures (e.g.. 
Fowler, 1977; Kag^er, Kelso & Turvqr, 1980, 1982; 
Kugler & Turvey, 1987; Saltzman, 198u; Saltzman 
& Kelso, 1987; Turvey, 1977) must be defined 
abstractly or functionally in a task-specific, 
flexible manner. Coordinative structures have 
been conceptualized within the theoretical and 
empirical framework provided by the field of 
(dissipative) nonlinear (Ramies (e.g., Abraham & 
Shaw, 1982, 1986; Guckenheimer & Holmes, 1983; 
Haken, 1983; Thompson & Stewart, 1986; 
Winfree, 1980). Specifically, it has been 
hypothesized (e.g., Kugler et al., 1980; Saltzman & 
Kelso, 1987; Turvey, 1977) that coordinative 
structures be defined as task-specific and 
autonomous (time-invariant) dynamical systems 
that underlie an action's form as well as its 
stability properties. These attributes of task- 
specific flexibility, functional definition, and time- 
invariant dynamics have been incorporated into a 
task-dynamic model of coordinative structures 
(Kelso, Saltzman & Tuller, 1986a, 1986b; 
Saltzman, 1986; Saltzman & Kelso, 1987; 
Saltzman et al., 1987). In the model, time- 
invariant dynamical systems for specific skilled 
actions are defined at an abstract (task space) 
level of system description. These invariant 
dynamics underlie and give rise to cotitextually- 
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dependent patterns of change in the dynamic 
parameters at the articulatory level, and hence to 
contextuaIIy*varying patterns of articulator 
ti Rectories. Qualitative differences between the 
stable kinematic forms required by different tasks 
are captured by corresponding topological 
distinctions among task-space attractan (see also 
Arbib, 1984, for a related disousion of the relation 
between task and controller structures). As 
applied to limb control, for example, gestures 
involving a hand's discrete motion to a single 
spatial target and repetitive cyclic motion between 
two such targets are characterized by time- 
invariant paint aitractor$ (e.g:t as with a damped 
pendulum or damped mass-spring, whose motion 
decays over time to a stable equilibrium point) 
and periodic extractors (limit ^les; e.g., as with 
an escapementKlriven pendulum in a grandfather 
clock, whose motion settles over time to a stable 
oscillatory ^cle), respectively. 

Model articulator and tract variable 
coordinates 

In speech, a migor task fo^ the articulators is to 
create and release constrictions locaiiy in different 
regions of the vocal tract, e.g., at the lips for 
bilabial consonants, or between the tongue 
dorsum and palate for some vowels.^ In task- 
djmamics, constrictions in the vocal tract are 
governed by a dynamical system defined at the 
interarticulator level (Figure 2) according to both 
tract variable (e.g., bilabial aperture) ani model 
articulator (e.g., lips and jaw) coordinates. Tract 
variables are the coordinates in which context- 
independent gestural ^intents" are framed, and 
model articulators are the coordinates in which 
contBxt-dependent gestural performances are 
expressed The distinction between tract-variables 
and model articulators reflects a behavioral 
distinction evident in speech production. For 
example, in a vowel-bilidbial-vowel sequence a 
given degree of effective bilabial closure may be 
achieved with a range rf different lip-jaw 
movements that reflects contextual differences in 
the identities of the flanking vowels (e.g., 
Sussman, MacNeilage, & Hanson, 1973). 

In task-dynamic simulations, each constriction 
type (e.g., bilabial) is associated with a pair 
(typically) of tract variables, one that refers to the 
location of the constriction along the longitudinal 
axis of the vocal tract, and one that refers to the 
degree of constriction measured perpendicularly to 
the longitudinal axis in the sagittal plane. 
Furthermore, each gestural/constriction type is 
associated with a particular subset of model 



articulators. These simulations have been 
implemented using the Haskins Laboratories 
software articulately synthesizer (Rubin, Baer & 
Mermelstein, 1981). The ^thesizer is based on a 
midsagittal view of the vocal tract and a simplified 
kinematic description of the vocal tract's 
articulatory geometry. Modeling work has been 
performed in cooperation with several of mr 
colleagues at Haskins Laboratories as part of an 
ongoing project focused on the development of a 
gesturally-based, computational model of 
linguistic structures (Browman & Goldstein, 1986, 
in press; Browman, Goldstein, Kelso, Rubin, & 
Saltzman, 1984; Browman, Goldstein, Saltzman, 
& Smith, 1986; Kelso et aL, 198C^, 1986b; Kelso, 
Vatikiotis^Bateson, Saltzman, & Kay, 1985; 
Saltzman, 1986; Saltzman et al., 1387). 

Figures 3 and 4 illustrate the tract variables 
and articulatoiy degrees-of-fi*eedom that are the 
focus of this article. In the present model, they are 
associated with the control of bilabial, tongue- 
dorsum, and "lower-tooth-heig^t* constrictions.^ 
Bilabial r^^tures are specified according to the 
tract vh. bles of lip protrusion (LP; the 
horizontal distance of the upper and lower lips to 
the upper and lower front teeth, respectively) and 
lip aperture (LA; the vertical distance between the 
lips). For bilabial gestures the four modeled 
articulatoiy components are: yoked horizontal 
movements of the upper and lower lips (LH), jaw 
angle (JA), and independent verticid motions of 
the upper lip (ULV) and lower lip (LLV) relative to 
the upper and lower front teeth, respectively. 
. Tongue-dorsum gestures are specified according to 
the tract variables of tongue-dorsum constriction 
location (TDCL) and constriction degree (TDCD). 
These tract variables are defined as fiinctions of 
the current locations in head-centered coordinates 
of the region of maximum constriction between 
the tongue-body surface and the upper and back 
walls of the vocal tract The articulator set for 
tongue-dorsum gestures has three components: 
tongue body radial CPBR) and angular (TBA) 
positions relative to the jaw's rotation axis, and 
jaw angle (JA). Lower-tooth-hei^t gestures are 
specified according to a single tract variable 
defined by tht vertical position of the lower fVont 
teeth, or equivalently, the vertical distance 
between the upper and lower front teeth. Its 
articulator set is simply jaw angle (JA). This tract 
variable is not used in most current simulations, 
but was included in the model to test hypotheses 
concerning suprasegmental control of the jaw 
(Macchi, 1985), the role of lower tooth height in 
tongue blade fricatives, etc. 
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Figure 3. Sdicmatk midaagillal vocd Inct outline, with tract-variable dr* cca of ficcUom indicated by amwt. (mc text 
for definitions of tnct-viiiable abbieviationt uaed). 
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Figure 4. Matrix icpmcnting the iclationdiip between tnc^variable• (z) and model aiticulatbn ($). The filled cell* in a 
given tncl>variable row denote the model articulator components of that tr ct-variable's aiticulatoiy set The empty 
cells indicate that the corresponding articulators do not contribute to the tnct«variable's motion. (Sec text for 
definitions of abbreviations used in the figured 
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Each gesture in a simulated utterance is 
associated with a corresponding tract-variable 
dynamical system. At present^ all such dynamical 
systems arc defined as tract*variable point- 
attractors, i.e., each is modeled current^ by a 
damped, second-order linear differential equation 
(analogous to a damped mass-spring). The 
corresponding set of tract-variable motion 
equations is described in Appendix 1. These 
equations are used to specify a functionally 
equivalent dynamical system expressed in the 
model articulator coordinates of the Haskins 
articulatory iiynthesizer. This model articulator 
dynamical ^tem is used to generate articulatory 
motion patterns. It is derived transfbrming the 
tract-variable motion equations into an 
articulatory space whose components have 
geometric attributes (size, shape) but are 
massless. In other wor Js» this tran^rmation is a 
strictly kinematic one, and involves only the 
substitution of variables defined in one coordinate 
system for variables defined in another coordinate 
system (see ^pendix 2). 

Using the model articulator dynamical ^stem 
(Equation CA4] in ^pendix 2) to simulate simple 
utterances, the task-4ynamic model has been Me 
to generate many important aspects of natural 
articulation. For example, the model has been 
used to reproduce experimental data on 
comptfiMtory articulation^ whereby the speech 
system quickly and automatically reorganizes 
itself when faced with unexpected mechanical 
perturbations (e.g.. Abbs & Gracco, 1983; Folkins 
& Abbs, 1975; Kelso, Tuller, Vatikiotis-Bateson & 
Fowler, 1984; Munhall & Kelso, 1985; Munhall, 
Lofqvist & Kelso, 1986; Shaiman & Abbs, 1987) or 
with static mechanical alterations of vocal tract 
shape (e.g.. Gay, Lindbiom, & Lubker, 1981; 
MacNeilage, 1970). Fuch compensation for 
mechanical disturbances is achieved by 
rea4justing activity over an entire subset of 
articulators in a gesturally-specific manner. The 
task-dynamic model has been used to simulate the 
compensatory articulation observed during 
bilabial closure gestures (Saltzman, 1986; Kelso et 
al., 1986a, 1986b). Using point-attractor (e.g., 
damped mass- spring) dynamics for the control of 
lip aperture, when the simulated jaw is '^oeen* in 
place during the closing gesture, at least the main 
qualitative features of the data are captured 1^ 
the model, in that: 1) the target Ulabial closure is 
reached (although with different final articulator 
configurations) for both perturbed and 
unperturbed "trials,** and 2) compensation is 
immediate in the upper and lower lips to the jaw 



perturbation, i.e., the system does not require 
reparameterization in order to compensate. 
Significantly, in task-dynamic modeling the 
processes governing intra-gestural motions of a 
given set of articulators (e.g., the bilabial 
articulatory set defined 1^ the jaw and lips) are 
exactly the same during simulations of both 
unperturbed and mechanically perturbed active 
gestures. In all cases, the articulatory movement 
patterns emerge as implicit consequences of the 
gesture-specific dyncmical parameters (i.e., tract- 
variable parameters and articulator wei^^ts; see 
^pendices 1 and 2), and the ongoing postural 
state (perturbed or not) of the articulators. 
Er::)licit trajectory planning and/or replanning 
procedures are not required. 

Gestural activatinn coordinates 

Task dynamics identifies several different time 
spans that are important for conceptualizing the 
dynamics of speech production. For example, the 
settling time of an unperturbed discrete bilabial 
gesture is the time required for the system to 
move from an initial position with zero velocity to 
within a criterion percentage (e.g., 2%) of the 
distance between initial and target positions. A 
gesture's settling time is determined jdntiy by the 
^ inertia, stifTness, and damping parameters 
intrinsic to the associated tract-variable point 
attractor. Thus, gestural duration or settling time 
is implicit in the dynamics of the interarticulator 
level (Figure 2, bottom), and is not represented 
explicitly. There is, however, another time span 
that is defined by the temporal interval during 
which a gestural unit actively shapes movements 
of the articulators. In previous sections, the 
concept of gestural activity was used in an 
intuitive manner only. We now define it in a more 
specific fashion. 

Intervals of active gestural control are specified at 
the intergestural level (Figure 2, top) with respect 
to the system's activation variables. The set of 
activation variables defines a third coordinate 
system in the present model, in addition to those 
defined by the tract variables and model 
articulators (see Figures 2 & 5). Each distinct 
tract-variable gesture is associated with its own 
activation variable, a,'^, where the subscript-i 
denotes numerically the associated tract variable 
(i s 1, .... m)f and the subscript-k denotes 
symbolically the particular gesture's linguistic 
affiliation ik s /p/^ /i/, etc.). The value of a^^ can be 
interpreted as the strength with which the 
associated tract-variable dynamical system 
^attempts" to shape vocal tract movements at any 
given point in time. 
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Figure 5. Exuuple of the ^mmUmicMl'' conncctivlhr pattem defined among the model's thice coordinate systems. BL 
and TD denote tnct-vaiiablea associated widi bilabial and tongue-doiram constrictions, mpectively. 



In current simulations the temporal patten^ing 
of gestural activity is accomplished with reference 
to a gettural (Figure 6) that represents the 
activation of gestural primitives over time across 
parallel tract-variable ou^ut channels. Currently, 
these activation patterns are not derived from an 
underlying imp cit dynamics. Rather, these 
patterns are spr ied explicitly Tiy hand*, or are 
derived according to a rule-based synthesis 
program called GEST that accepts phonetic string 
inputs and generates gestural score outputs 
(Browman et al., 1986). In the gestural score fcr a 



given utterance, the corresponding set of 
activation functions is specified as an explicit 
matrix function of time, A(jt). For purposes of 
simplicity, the activation interval of a given 
gesture-ik is specified according to the duration of 
a step-rectangular pulse in normalized to unit 
hei^t (a^i^e {0, 1}). In future developments of the 
task-dynamic model (see the Serial Dynamics 
section later in the article), we plan to generalize 
the shapes of the activation waves and to allow 
activations to vary continuously over the interval 
(!)^aff,^l)fi 
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Figure & Gestiuil score used to synthesize the sequence IpM. Filled boxes denote intervals of gestural activation. Box 
heights are uniformly either 0 (no activation) or 1 (full activation). The waveform lines denote tract-variable 
trajectories generated during the simulation. 
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COPRODUCnON 

Temporally discrete or isolated gestures are at 
best rare exceptions to the rule of temporal 
interleaving and overlap (coproduction) among 
gestures associated with nearby segments (e.g., 
Bell-Berti & Harris, 1981; Fowler, 1977, 1980; 
Harris, 1984; Keating, 1985; Kent & Minifie, 1977; 
Ohman, 1966, 1967; Perkell, 1969; Sussman et al., 
1973; for an historical review, see Hardcastle, 
1981 ). This interleaving is Uie source of the 
ubiquitous phenomenon of coarticulation in 
speech production. Coarticulation reftrs to the 
fact that at any given point during an utterance, 
the influences of gestures associated with several 
a4jacent or near-ac|jacent segments can generally 
be discerned in acoustic or articulator} 
measurements. Coarticulatory effects can occur, 
for example, when lip protrusion for a following 
rounded vowel begins during the preceding 
phonologically unrounded consonant, thereby 
coloring the acoustie correlates of the consonant 
with those of the following vowel. Similarly, in a 
vowel-^«vowel sequence the formation of the 
bilabial closure for /p/ (using the jaw and lips) 
appears to be influenced by temporally 
overlapping demands associated with the 
followinf^ vowel (using the jaw and tongue) 1^ 
virtue . the shared jaw component'^ In the 
context of the present model (see also Coker, 1976; 
Henke, 1966), these overlapping demands can be 
represented as overlapping activation patterns in 
a corresponding set of gestural scores. The 
specification of gestural scores (either fay hand or 
by synthesis rules) thereby allows rigorous 
experimental control over the temporal onsets and 
offsets of the activations of simulated gestures, 
and provides a powerful computational framework 
and research tool for exploring and testing 
hypotheses derived from current work in 
experimental phonology/phonetics. In particular, 
these methods have facilitated the exploration of 
coarticulatory phenomena that have been ascribed 
to the effects of partial overlap or coproduction of 
speech gestures. We now describe in detail how 
gestural activation is incorporated into ongoing 
control processes in the model, and the effects of 
coproduction in shaping articulatory movement 
patterns. 

Active gestural control: Tuning and gating 

How might a gesture gain control of the vocal 
tract? In the present model, when a given 
gesture's activation is maximal (arbitrarily 
defined as 1.0), the gesture exerts maximal 



influence on all the articulatory components 
associated with the gesture's tract-variable set 
During each such activation interval, the evolving 
configuration of the model articulators results 
from the gesturally* and posturally-spedfic way 
that driving influences generated in the tract- 
variable space (Equation [Al], ^pendix 1) are 
distributed across the associated sets of 
articulatory components (Equations [A3] and [A4], 
^pendix 2) during the course of the movement 
Conversely, when the gesture's activation is 
minimal (arbitrarily defined as 0.0), none of the 
articulators are subject to active control influences 
from that gesture. What, then, happens to the 
model articulators when there is no active control? 
We begin by considering the former question of 
active control, and treat the latter issue of 
nonactive control below in the section Nonaetive 
Gestural Control. 

The driving influences associated with a given 
gesture's activation "wave* (see Figure 6) are 
inserted into the interarticulator dynamical 
system in two ways in our current simulations. 
'Hie first way serves to define or tune the current 
set of dynamic parameter values in del (ie., 
K, B, s^, and W in Equations [A. ^d [A4], 
Appendix 2; see also Saltzman & Kelso, 1983, 
1987 for a related discussion of parameter tuning 
ir the context of skilled limb actions). The second 
way serves to implement or gate the current 
pattern of tract-variable driving influences into 
the appropriate set of articulatory components. 
The current use of tuning and gating is similar to 
Bullock and Grossberg^s (1988a, 1988b; see also 
Cohen, Grossberg & Stork, 1988; Grossberg, 1978) 
use of target specification and *G0 signals," 
respectively, ii. their model of sensorimotor 
control. 

The details of tuning and gating processes 
depend on the ongoing patterns of overlap that 
exist among the gestures in a given utterance. The 
gestural score in Figure 6 captures in a 
descriptive sense both the temporal overlap of 
speech gestures as well as a related spatial type of 
overlap. As suggested in the figure, coproduction 
occurs whenever the activations of two or more 
gestures overlap partially (or wholly) in time 
within and/or across tract-variables. Spatial 
overlap occurs whenever two or more coproduced 
gestures share some or all of their articulatory 
components. In these cases, the influences of the 
spatial^ and temporally overlapping gestures are 
said to be blended. For example, in a vowel- 
consonant-vowel (VCV) sequence, if one assumes 
that the activation intervals of the vowel and 
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con tenant gasturaa overlap temporally (e.g., 
Ohman, 1967; Sutsman et aL, 1973), then one can 
define a oontinuum of supralaryngeal overlap that 
18 a fiinction of the gestiural identity of the medial 
consonant In such sequences, the flanking vowel 
gestures are defined, by hypothesis, along the 
tongue-dorsum tract variables and the associated 
articulatory set of jaw and tongue body. If the 
consonant is /h/, then there is no supralaiyngeal 
overlap. If the consonant is /h/, then its gesture is 
defined along the bilabial tract variables and the 
associated lips-jaw articulator set Spatial overlap 
occurs in this case at the shared jaw. \ the 
consonant is the alveolar /d/, its gesture is defined 
along the tongue-tip tract variables and the 
associated artic Jator set of tongue tip, tongue 
body, and jaw. Spatial overlap occurs then at the 
shared jaw and tongue bo4y . Note that in both th e 
bilabial and alveoltf instances the spatial overlap 
is not total, and there is at least one articulator 
free to vary, adaptavely and flexibly, in ways 
specific to its associated consonant Thus, Ohman 
(1967) showed, for a medial alveolar in a VCV 
sequence, that both the location and degree of 
tongue-tip constriction were unaffected by the 
identity of the flanking vowels, although the 
tongue^orsum's position was altered in a vowel • 
specific manner. Finally, if the medial consonant 
in a VCV sequence is the velar /gf, the consonant 
gesture is defined along exactly Uie same set of 
tract variables and articulators as the flanking 
vowels. In this case, there is total spatial overlap, 
and the system shows a loss of behavioral 
flexibility. T -^t is, there is now contextual 
variation evident even in the attainment of the 
consonant's tongue-dorsum constriction target; 
Ohman (1967), for example, showed that in such 
cases the velar^s place of constriction was altered 
by the flanking vowels, although the constriction 
degree was unaffected. 

Blending due to spatial and temporal overlap 
occurs in the model as a fiinction of tiie manner in 
which the current gestural activation matrix. A, is 
incorporated into the interarticulator dynamical 
system. Thus, blending is implemented with 
respect to both the gestural parameter set 
(tuning) and the transformation from tract* 
variable to articulator coordinates (gating) 
represented in Equations (A3) and (A4). In the 
following paragraphs, we describe first the 
computational implementation of these activation 
and blending processes, and then describe the 
results of several simulations that demonstrate 
their utility. 



Parameter tuning. Bach distinct simulated 
gesture is linked to a particular subset of tract- 
variable and articulator coordinates, and has 
associated with it a set of time-invariant 
parameters that are likewise linked to these 
coordinate systems. For example, a tongue- 
dorsum gesture's stiffiiess, damping, and target 
parameters are associated with the tract variables 
of tongue-dorsum constriction location and degree; 
its articulator weighting parameters are 
associated with the jaw angle, tongue-body radial, 
and tongue-body angular degrees of freedom. 
Values for these parameters are estimated from 
kinematic speech data obtained by optoelectronic 
or X-ray measurements (e.g., Kelso et al., 1985; 
Smith, Browman, & McGowan, 1988; Vatikiotis- 
Bateson, 1988). The parameter set for a given 
gesture is represented as: 

where the subscripts denote numerically either 
tract variables (is l, .... m) or articulators 
0' = 1» n), or denote aymboli^ly the particular 
gesture's linguistic affiliation (k s /p/^ ni, etc.). 
These parameter sets are incorporated into the 
interarticulator dynamical system (see Equatior s 
[A3] and [A4], Appendix 2) as fiuictions of the 
current gestural activation matrix, A, according to 
explicit algebraic blending rules. These rules 
define or tune the current values for the 
corresponding components of the vector and 
matrices JST, B, and W in Equations (A3) and (A4) 
as follows: 

(la) 

teZi (lb) 

r (PTifc^^ ;and 
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where andp||r^^ are variables denoting the 
pof^blending strengths of gestures whose 
activations influence the ongoing values of tract- 
variable and articulatory-w«ki|^ting parameters, 
respectively, in Equations Al - A4 (i^pendices 1 
and 2); is the set of gestures associated with the 
tract-variable; is the set of tract-variables 
associated vith the model articulator 
(see Figuri 4); and 8njj ^ 1*0 - min 



. The subscript N denotes 
the fact that, for parsimony's sake, gsjf are the 
same elements used to gate In they^ articolatoiy 
component of the neutral attractor (see the 
Nonactive Otitunl Control action to follow). In 
Equations (la-lc), the parar^N^^ters of the i^^ tradr 
variable assume default values of zero at times 
when there are no active gestures that involve 
this tract-variable. Similarly, ^ - Equation (Id), the 
articulatory weighting parameter of the 
articulator a^^junes a default value of 1.0, due 
the contribution of the gj^j^ term, at times when 
there are no active gestures involving this 
articuiu ^r. 

PTik Pwikj terms in Equation 1 are 
given by the steady-state solutions of a set of 
feedforward, competitive-interaction-network 
dynamical Equations (see Appendix 3 for details). 
Tliese solutions are expressed as follows: 



[ctuau] 



(2a) 



















1 



(2b) 

where 0,7 = competitive interaction (lateral 
inhibition) coefficient from gesture-i/ to geshu^-iik, 
for and s a ''gatekeeper" coefficient that 
modulates the incoming lateral inhibition 
influences impinging on gesta . ik from gesture- 
iZ, for/94; For parsimony, p . is constrained to 
equal 1.0/0^^^ for # 0.0. If o = 0.0, is set ^0 
equal 0.0 ty convention. Implementing the 
blended parameters defined by Equations (la-lc) 
into the dynamical system defined by Equations 
(A3) and (A4) creates an attractor layout or field of 



driving influences in tract-variable space that is 
specific to the set of currently active gestures. The 
blended param-^<^rs defined by Equation (Id) 
create a corresponding pattern of relative 
'receptivities'' to these driving influences among 
the associated synergistic articulators in the 
coordinative structure. 

Using the blending rules provided by Equations 
(1) and (2), different forms of blending can be 
specified, for example, among a set of temporally 
overlapping gestures defined within the same 
^ract variables. The form of blending depends on 
the relative sizes of the context-independent 
(time-invariant) a and P parameters associated 
with each gesture. For a.^€ (0, 1), three 
possibilities :;veraging, suppressing, and 
adding. For the set of currently active gestures 
along the fi^ tract variable, if all o^s are equal and 
greater than zero (all P's are then equal by 

constraint), then the ]^ ppft normalized to 

equal 1.0 and th3 tract-variable paiameters blend 
by simple averaging. If the a's are unequal and 

greater than zero, then the ^ /^^ is also 

normalized to equal 1.0 and the parameters blend 
by a weighted averaging. For example, if 
gesture-it's 10.0 and gesture-il's a^^ = 0.1, 
then gesture-iA's parameter values dominate or 
"suppress** gesture-il's parameter values in the 
blending process when both gestures are co-active. 
Finally, if all a's = 0.0, then all P's = 0.0 by 
convention, and the parameters in Equation (1) 
blend by simple addition. Currently, all gestural 
parameters in Equation (1) are subject to the 
same form of competitive ble. ding. L would be 
possible at some point, however, to implement 
difTerent blending forms for the different 
parameters, e.g., adding for targets and averaging 
for stiffnesses, as suggested by recent data on lip 
protrusion (Boyce, 1988) and laryngeal abduction 
(Munhall & Lofiivist, 1987). 

Transformation gating. The tract-variable 
driving influences shaped, by Equations (1) and (2) 
lemain implicit and '^disconnected* from the 
receptive model articulators until these influences 
are gated explic^'^y into the articulatoiy degrees of 
ft^eedom. This gating occurs with respect to the 
weighted Jacobian pseudoinverse (i.e., the 
transformation that relates tract-variable motions 
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to articulatory motions) ^ad its associated 
orthogonal projection operator (see Appendix 2). 
Specifically. J* and 1^ are replaced in Equation 
(A4) fay gated forms. «Pq and Gp, respectively. J*q 
can be expressed as follows: 



(3a) 



where Jq = G^J. and is a diagonal mxm 
gating matrix for the active tract-variable 



gestures. Each gj^^^min ( X ^1 ' 

summation is defined as in Equatii is (1) and (2). 
Each gj^ multiplies jthe iS^ row of the Jacobian. 
This row relates motions of the articulau)rs to 
motions defined along the i^^ trr^ variable (see 
Equation [A21).Whaig^ s l(orO).thei^ tract 
variable is gated into (out of) J*q and contributes 
(does not contribute) to i^, the vector of active 
articulatory driving influences (see Equations [A3] 
and [A4]); C » Jq^%^. C embodies the 
kinematic interrelationships that exist among the 
currently active set of tract variables. Specifically. 
C is defined by the set of wei^ted. pairwise inner 
products of the gated Jacobian rows. A diagonal 
element of C. Cup is the weighted innei* product 
(sum of squares) of the i^^ gated Jacobian row 
with itself; an off-diagcnal element, (h^ i). is 
the weii^ted inner product (sum of products) of 
the h^^ and i^^ gated Jacobian rows. A pair of 
gated Jacobian rows h ^s a (generally) nonzero 
weighted inner produc«* »>'hen the corresponding 
t. : ct variables are active and share some or all 
articulators in common; the weighted inner 
product of two gated Jacobian ro^s equals zero 
when the corresponding tract variables are active 
and share no articulators; the inner product also 
< 4uals zero when one or both rows correspond to a 
nonactive tract variable; and In^amxm identity 
matrix. 

The gated orthogor >1 projection operator is 
expressed as follows: 

[Gp- J^gJ]. (3b) 
where Gp = a diagonal n xn gating matrix. 



Each element 



min 




where the summations are defined as in 
Equations (1) and (2). 

For example, if there are no active gestures then 
Ga « 0. Gp = 0. and ( C + [ I„ - G;; ] ) = 
Consequently, both J*q and the gated orthogonal 
projection operator equal zero, and 0^ = 0 
according to Equation (/ it active gestures 
occur simultaneously in all tract variables, then 
Ga = Im* Gp = In, and J*o = J*. That is. both the 
gated pseudoinverse and orthogonal projection 
operator are "lull blown* when all tract variables 
are active, and is influenced by the attractor 
layouts and corresponding driving influences 
defined over all the tract variables. If only a few of 
the tract variables are active, these terms are not 
full blown and is only subject to driving 
influences associated with Uie attractor layout in 
the subspace of active tract variables. 

Nonactive gestural control: The neutral 
attractor 

We return now to the question of what happens 
to the model articulators when there is no active 
control in the model. In such cases, articulator 
ntovements are shaped by a default* neutral 
ati Victor. The neutral attractor is a point attractor 
in model articulator space, whose target 
configuration correspond* to schwa hi in current 
modeling. It is possible, however, that this neutral 
target uiay language-specific. The articulatory 
degrees of fi^eedom in the neutral attractor are 
uncoupled dynamically, i.e.. point attractor 
dynamics are defined independently for 
articulator^. At any given point in time, 
neutral attractor exerts a set of driving influences 
on the articulators, i^. that can be expressed as 
follows: 



0N = GN(-3N0-iy0-iNo]). 



(4) 
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where 0 ^s the neutral target configuration; ^ 
and are n X n diagonal damping and stiffness 

matrices, respectively. Because their parameters 
never change, parameter tuning or blending is not 
defined for the neutral attractor. The components. 
.. of nre typically defined to be equal. 

although they may be defined asymmetrically to 
reflect hypothesized differences in the 
biomechanical time constants of the articulators 
(e.g.. the jaw is more sluggish [has a larger time 
constant] than the tongue tip). The components. 
^NJi* %i defined at present relative to the 
corresponding components to provide critical 
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dairping for each articulator's neutral point 
attractor. For parsimon/s sake, B^^ is also used to 
define the orthogonal projection vector in 
Equadon (A4); and Gj^ is a n x n diagond gating 
matrix for the neutral attractor. Each 



element j)y^* » 1.0 - min 



■•I 



where 



L ie0y \keZi 
the summations are defined as in Equations (1-3). 
Note that G)^ » Iq - Gp, where is the n x n 
identity mat*ix and Gp is defined in Equation (3b). 

The total set of driving influences (iu ) on the 
articulators at any given point in time is the sum 
of an active component (i^; see Equations [A3], 
[A4], and [3]) and a neutral component (jL.; see 
Equation [4] ), and is defined as follows: 



(5) 



For example, consider a time when oniy a tongue- 
dorsum gesture is active. Then g^^^ ^^^^ ^ 
8N33 tfor ULV) s gj^^ (for LLV) s LO, and gj^22 
(for JA) = g^^^ (for TBR) = g^^^ (for TBA) = 0. 
Active coatrol will exist for the gesturally involved 
jaw and tongue (JA, TBR, and TBA), but the 
noninvolved lips (LH, ULV, and LLV) will "relax* 
independently from their current positions t^^ard 
their neutral positions according to the specifaed 
time constants. If only a bilabial gesture is active, 
then the complementary situation holds, with 

8mi ^8N33^gN44 0 » ^-^8n32 = 8N5S ^8N66 * 
1.0. The jaw and lips will be actively controlled, 
and the tongue will relax toward its neutral 



configuration. When both bilabial and tongue- 
dorsum gestures are active simultaneously, all 
g^jj components equal zero, » 0, and the 
neutral attractor has no influence on the 
articulatory movement patterns. When there is no 
active control, all g^ji components equal one, and 
all articulators relaxloward their neutral targets. 

Simulation examples 

We now describe results from several 
simulations that demonstrate how active and 
neutral control influences are implemented in the 
model, focusing on instances of j^estural 
coproduction. 

Parameter tuning. The form of parameter 
blending for speech production has been 
hypothesized to be tract-variable-specific 
(Saltzman et al., 1987). As already discussed in 
the Active Qtstural Control section, Ohman (1967) 
showed that for VCV sequences, when the medial 
consonant was a velar (/gf or Tk/), the surrounding 
vowels appeared to shift the velar^s place of 
constriction but not its degree. These results have 
been simulated (qualitatively, at least) by 
superimposing temporally the activation intervals 
for the medial velar consonant and the flanking 
vowels. During the resultant period of consonant- 
vowel coproduction, an averaging blend was 
implemented for tuning the tract-variable 
parameters of tongue-dorsum-con^itriction- 
location, and a suppressing blend (velar 
suppresses vowel) was implemented for tongue* 
dorsuro-constriction-degree (see Figure 7; 
Saltzman et al., 1987; cf., Coker, 1976, for an 
alternative method of generating similar results). 
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This blending scheme for constriction degree is 
consistf it with the assumption in current 
modeling that the amount of suppression during 
blending is related to differences in the sonority 
(Jesp9rson» 1914) or openness of the vocal tract 
associated with each of the blended gestures. 
Gestural sonority is reflected in the constriction 
degree target parameters of each gesture. For 
tongue-dorsum gestures, vowels have large 
positive-valued targets for constriction degree t!:at 
reflect their open tract shapes (high sonority), and 
stops have small negative target values that 
reflect contact-plus-compressiop against the upper 
tract wall Qow sonority). 

Transformation gating. Simulations 
described in this article of blending for gestures 
defined along different tract variables have b^en 
restricted to periods of temporal overlap between 
pairs of bilabial and tongue-dorsum gestures. 
Under these circumstances, articulatory trigec- 
tories have been generated for sequences 



involving consonantal L?^8tures superposed onto 
ongoing vocalic gestures that match (qualitatively, 
at least) the tngectories observed in X-ray data. In 
particular, Tiede and Browman (1988) analyzed 
X-ray data that included the vertical motions of 
pellets placed on the lower lip, lower incisor (i.e., 
jaw), and 'mid-tongue'' surface during /pV.pV^p/ 
sequences. The mid-tongut pellet neight 
corresponds, /t)u|^ly, to tonguenlorsum hei^t in 
the current model Tiede and Bit>winan foimd that 
the mid-tongae pellet moved with a relatively 
smooth trajectory from its position at the onset of 
the first vowel to its position near the offset of the 
second vowel. Specifically, when V. was the 
medium height vowel k/ and was thelow vowel 
/a/, the mid-tongue pellet showed a smooth 
lowering tngectory over this gestural time span 
(see Figure 8). During this same interval, the jaw 
and lower lip pellets moved uiroug^M comparably 
smoDth gestural sequences of lowering for Vi, 
raising fo * the medial /p/, and lowering for V^. 
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Figure B. AcoutHc wavekonn and vertical components of articulatory X>niy pellet data during the uttenince /pepap/. 
(Frcm nede Ac Browman, 1988; used with authon' permiMion). 
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Figure 9a shows a simulation with the rrent 
model of a similar sequence /gbobo/. The i^ain 
pcint is that the vowel-to-vowel trigectory for 
tongue-dorsum-constriction-degree is smooth, 
going from the initial schwa to the more open /ae/. 
This tongue-dorsum pattern occurs simulta- 
neously with the comparably smooth closing- 
openin8[ gestural sequences for jaw heii^t and lip 
aperture. 

Two earlier versions of the present model 
generated nonacceptable tngectories for this same 
sequence that are instructive concerning the 
model's functioning. In one version (the 'modular^ 
model), each constriction type operated 
independently of the other during periods of 
coproduction. For example, during periods of 
bilabial and tongue-dorsum overlap, driving 
imiuences were generated along the tract- 
variables associated nlth eadi constriction. These 
influences were then transformed into articulatory 
driving influences by separate, constriction- 
specific Jacobian pseudoinverses (e.g., see 
Equations [A3] and [A4]). Thu bilabial 
pseudoinverse involved only tiie Ja( obian rows 
(see Equation [A2] and Figure 4) for .ip aperture 
and protrusion, and the tongue-dorsum 
pseudoinverse involved on|y the Jacobian rows for 



tongue*dorsum constriction location and degree. 
The articulatoiy driving influences associated 
with each constriction were simply averaged at 
the articulatory level for the shared jaw. 
The results are shown in Figure 9b, where it is 
evident that the tongue-dorsum does not display 
the relatively smooth vowel-to-vowel trajectory 
seen in the X-ray data and with the current 
model. Rather, the trigectory appears to be 
perturbed in a complex manner by the 
simultaneous jaw and lip aperture motions. It is 
hjfpothesized that these perturbations are due to 
the fact that the modular model did not 
incorporate, by definition, the off-diagonal 
elements of the (T-matrix used currently in the 
gated pseudoinverse (Equation [3]). Recall that 
these elements reflected the kinematic 
relationships that exist among different, 
concurrently active tract-variables by virtue of 
shared articulators. In the modular model, these 
terms were absent because the constriction- 
specific pseudoinverses were defined explicitly to 
be independent of each other. Thus, if the current 
model is a reasonable one, it tells us that 
knowledge of inter-tract-variable kinematic 
relationships must be embodied in the control and 
coordinative processes for speech production. 
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A different ^e of failure by a second earlier 
version of the model provides additional 
constraints on the form that must be taken by 
such inter-tract-variable knowledge. En route to 
developing the current model, a mistake was 
made that generated a perfectly flat jaw trajectory 
(the *flat jaw* model) fcr the same sequence 
(/obdbc/; see Figure 9c). Interesting^, h(^wever, 
the tongue-dorsum trajectory was virtually 
identical to that generated with the current 
model. The reason for this anomalous jaw 
behavior was that the gated pseudoinverse 
(Equation [3]) had been forced accidentally to be 
Tull blown* regardless of the ongoing state of 
gestural activation. This meant that all tract 
variabres were gated on in this transformation, 
even when the associated gestures were not 
activated. The specification of the attractor layout 
at the tract-variable level, however, worked as it 
does in the current model. Active gestures 'create* 
point attractors in the control landscape for the 
associated tract variables. In this landscape, the 
currently active tcjrget can be considered to lie at 
the bottom of a valley whose walls are sli^tly 
sticky. The resultant tract*variable motion is to 
slide stably down the valley wall from its current 
position toward the target, due to the nonzero 
driving influences associated with the system's 
attraction to the target position. Nonactive 
gestures, on the othev* hand, "create* only flat 
tract-variable control landscapes, in which no 
position is preferred over any oUier and the value 
of the tract-variable driving influences equals 
zero. Recall from the Otstural Primitives section 
(Figures 3 and 4) that the model inclu ^es a lowers 
tooth-height tract variable that maps one-to-one 
onto jaw angle. For the sequence /gbsba/, this 
tract variable is never active and, consequently, 
the corresponding component of the tract-variable 
driving influence vector is constantly equal to 
zero. When the gated pseudoinverse is full blown, 
this transformation embodies the kinematic 
relationships among the bilabial, tongue-dorsum, 
and lower-tootli-height tract variables that exist 
by virtue of the shared jaw. This means that the 
transformation treats the zero driving component 
for lower-tooth-height as a value that should be 
passed on to the articulators, in conjunction with 
the driving influences from the bilabial and 
tongue-dorsum constrictions. As a result, the jaw 
receives zero active driving, and because the jaw 
starts off at its neutral position for the initial 
schwa, it also recti' js zero driving fh>m the 
neutral attractor (Equation [4]) throughout the 
sequence. The result is the observed flat tngectory 



for the jaw. Thus, if the current model is a 
sensible one, this nonacceptable *flat jaw* 
simulation tells us that the kinematic 
interrelationships embodied in the system's 
pseudoinverse at any given point in time must be 
gated functions of the currently active gesture set 

SERIAL DYNAMICS 

The task-dynamic model defines, in effect, a 
selective pattern of coupling among the 
articulators that is specific to the set of currently 
active gestures. This coupling pattern is shaped 
according to three factors: a) the cmrent itate of 
the gestural activation matrix; b) the tract- 
variable parameter sets and articulator weights 
associated with the currently active gestures; and 
c) the geometry pf the nonlinear kinematic 
mapping between articulatory and tract-variable 
coordinates (represented hy J and in Equations 
[A2] and [A3]) for all associated active gestures. 
The model provides an intrinsically dynamical 
account of multiarticulator coordination within 
the activation intervals of single (perturbed and 
unperturbed) gestures. It also holds promise for 
understanding the blending dynamics of 
coproduced gestures that share articulators in 
common. However, task-dynamics does not 
currently provide an intrinsically dynamic account 
of the intergestural timing patterns comprising 
even a simple speech sequence (see Figures 1 and 
6). At the level of phonologically defined segments, 
the sequence might be a repetitive alternation 
between a given vowel and consonant, e.g., 
/bababH.../. At a more fine-grained level of 
Inscription, the sequence might be a 
''constellation* (Br^wman & Goldstein, 1986, in 
press) of appropriately phased gestures, e.g., the 
bilabial closing-opening and the laryngeal 
01^ <rTiing-clo8ing for word-initial /p/ in English. As 
discussed earlier, current simulations rely on 
explicit gestural scores to provide the layout of 
activation intervals over time and tract variables 
for such utterances. 

The lack of an appropriate serial dynamics is a 
migor shortcoming in our speech modeling to date. 
This shortcoming is linked to the fact that the 
most-studied and best-understood dynamical 
systems in the nonlinear dynamics literature are 
those whose behaviors are governed by point 
attractors, periodic attractors (limit cycles), and 
strange attractors. (Strange attractors underlie 
the behaviors of chaotic dynamical systems, in 
which seemingly random movement patterns have 
deterministic origins; e.g., Ruelle, 1980). For 
nonrepetitive and nonrandom speech sequences, 
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such attractors appear clearly inadequate. 
However, investigations in the computational 
modeling of connectionist (parallel distributed 
processing, neuromorphic, neural net) c^amical 
ssrstems have focused on Uie problem of sequence 
control and the understanding of serial dynamics 
(e.g., Orossberg, 1986; Jordan, 1986, in press; 
Kleinfeld & Sompolinsky, 1988; Lapedes & 
Farber, cited in Lapedes & Farber, 1986; 
Pearlmutter, 1988; Rumelh^rt, Hinton, & 
Williams, 1986; Stometta, Hogg, & Huberman, 
1988; Tank & Hopfield, 1987). Such dynamics 
appear well*suited to the task of sequencing or 
orchestrating the transitions in activation among 
gestural primitives in a dynamical model of 
speedi production. 

Inteigestural i jning: A connectionist 
approach 

Explaining how a movement sequence is 
generated in a connectionist computational 
network becomes primarily a matter of explaining 
the patterning of activity over t^me among the 
network's processing elements or nodes. This 
patterning occurs through cooperative and 
competitive interactions among the nodes 
themselves. Each node can store only a small 
amount of information (typically only a few 
marker bits or a single scalar activity-level) and is 
capable of only a few simple arithmetic or logical 
actions. Consequently, the interactions are 
conducted, not through individual programs or 
symbol strings, but through very simple 
messages— signals limited to variations in 
strength. Such networks, in which the 
transmission of symbol strings between r^'des is 
minimal or nontidatent, depend for their success 
on the availability and attunement of the right 
connections among the nodes (e.g., Ballard, 1986; 
Fahlman & Hinton, 1987; Feldr an & Ballard, 
1982; Grossberg, 1982; Tumelhart, dinton, & 
McClelland, 1986). The knowledge «:on&training 
the performance of a serial activity, including 
coarticulatory patterning, is embodied in these 
connections rather than stored in specialized 
memory banks. That is, the structure and 
dsoiamics of the network govern the movement as 
it evolves, and knowledge of the movement's time 
course never appears in an explicit, declarative 
form. 

In connectionist models, the plan for a sequence 
is static and timeless, and is identified with a set 
of input units. Output units in the network are 
assumed to represent the control elements of the 
movement components and to affect these 
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elements in direct proportion to the level of 
output-unit activation. One means of producing 
temporal ordering is to (a) establish an activation- 
level gradient through lateral inhibition among 
the output units so that those referring to earlier 
aspects of the sequence are more active than those 
referring to later aspects; and (b) inhibit output 
units once a threshold value of activation is 
achieved (e.g., Grossberg, 1978). Such 
connectionist systems, however, have difficulty 
producing seque* jes in which movement 
components are repeated (e.g., Rumelhart & 
Norman, 1982). In fact, a general a /kwardness in 
dealing with the sequential control of network 
activity has been acknowledged as a mm'or 
shortcoming of most current connecticnist models 
(e.g., Hopfield & Tank, 1986). Some promising 
developments have been reported that address 
such criticisms (e.g., Grossberg 1986; Jordan, 
1986, in press; Kleinfeld & Sompolinsky, 1988; 
Lapedes & Farber, cited in Lapedes & Farber, 
1986; Pearlmutter, 1988; Rumelhart, Hinton, & 
Willams, 1986; Stometta, Hogg, & Huberman, 
1988; Tank & Hopfield, 1987). We now describe in 
detail one such development (Jordan, 1986, in 
press). 

Serial dynamics: A representative model 

Jordan's (1986, in press) connectionist model of 
serial order can be used to define a time-invariant 
djmamical system with an intrinsic time scale 
that spans the performance of a given output 
sequence. There are three levels in his model (see 
Figure 10). At the lowest level are output units. 
Even if a particular output unit is activated 
repeatedly in an intended sequence^ it is 
represented by only one unit Thus, the riodc' 
adopts a typt rather than token representation 
scheme for sequence elements. In the context of 
the present article, a separate output unit would 
exist for each distinct gesture in a sequence. The 
tuning and gating consequences of gestural 
activation described earlier (see the Active 
Gestural Control section) are consistent with 
Jordan's suggestion that ''the output of the 
network is best thought of as influencing 
articulator trigectories indirectly, by setting 
parameters or providing boundaiy conditions for 
lower level processes which have their own 
inherent dynamics" (Jordan, 1986, p. 23). For 
example, in the repetitive sequence ^ababa..y, 
there would be (as a first approximation) only two 
output units, even though each unit potentially 
could be activated an indefinite number of times 
as the sequence continues. In this example, the 
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ontpyit units define the activation coordinates for 
the consonantal bilabial gesture and the vocalic 
tongue^^orsum gesture, respectively. The values 
of the ou^t units are the activation values of the 
associated gestures, and can vary continuously 
across a range normalized from zero (the 
associated gestur^^ unit is inactive) to one (the 
associated ger ' ut is maximal^ active). 




Figiifit 10. Basic artwork ardiiledure for Joidan's (1986, 
in press) conncctloniti model of serial order (not aU 
connections arc shovm). The plan units and their 
comiccdons (indicated in light gny) are not used in our 
proposed hifMd model ttie serial dynamics of ^eecfa 
production (tee text and footnote (81 for details). 

At the highest level of Jordan's model are the 
state units tliat, rou^ly speaking, define among 
themselves a (^amical flow with an intiHnsic 
time scale specific to the intended sequence. T iiese 
state-unit (Ramies are defined by an equation of 
motion (the next-state function) that is 
implemented in the model fay weighted recurrent 
connections among the state units themselves, 
and from the output units to the state units. 
Finalfy, at an intermediate level of the model are 
a set hidden units. These units are connected to 
both the state units and the ou^ut units by two 
respective layers of weighted paths, thereby 
defi;«ing a nonlinear mq)piti^ or output function 
from state units to output units. The current 
vector of output activations is a function of the 
preceding state, which is itself a function of the 
previous state and previous output, and so on. 
Thus, the patteming over time of onsets and 
offsets for the output units does not arise as a 
consequence of direct connections among these 
units. Rather, such relative timing is an emergent 



property of the dynamics of the network as a 
whole. Temporal ordering among the output 
elements of a gestural sequence is an implicit 
consequence of the network architecture (i.e., the 
input-output functions of the system elements, 
and the pattern of connections among these 
elements) and the sequence-speciHc set of 
constant values for the weights associated with 
each connection path.^ 

Hie network can *leam* a different set of weight 
values for each intended utterance in Jordan's 
(1986; in press) model, using a "teaching^ 
procedure that incorporates the generalized delta 
rule {back propagation method) of Rumelhart, 
Hinton, & Williams (1986). According to this rule, 
error signals generated at the output units 
(defined by the difference between the current 
output vector and a ^achin^ vector of desired 
activation values) are projected bad into the 
network to allow the hidden units to change their 
weights. The weights on each pathway are 
changed in proportdcn to the size of Uie error being 
back-propagated along these pathways, and error 
signals for each hidden unit are computed 
adding the error signals arriving at these units. 
Rumelhart, Hinton, & ^lliams (1986) showed 
that this learning algorithm implements 
essentially a gradient search in weight space for 
the set of weights that allows the network to 
perform with a minimum sum of squared output 
errors.® 

Jordan (1986; in press) reported simulation 
results in which activation values of the output 
units represented values of abstract phonetic 
features such as degree of voicing, nasality, ^r lip 
rounding. The serial network was trained to 
produce sequences of 'phonemes*, in which each 
phoneme was defined as a particular bundle of 
context-independent target values for the 
features. These features were not used to generate 
articulatory movement patterns, however. After 
training, the network produced continuous 
trajectories over time for the featural values. 
These tn^ectories displayed several impressive 
properties. First, the desired values were attained 
at the required positions in a given sequence. 
Second, the featural trajectories showed 
a.'tieipatory and carryover coarticulatory effects 
for c feature that were contextually dependent 
on the composition of the sequence as a whole. 
This was due to the generalizing capacity of the 
network, according to which similar network 
states tend to produce similar outputs, and the 
fact that the network states during production ^^f a 
given phoneme are similar to the states in which 
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nearby phonemes are learned. Finally, the 
coarticulatory temporal "spreading* of a given 
featural target value was not unlimited Rathi it 
was restricted due to the dropoff in state 
similarity between a given phoneme and its 
surrounding context 

Toward a hybrid dynamical model 

Jordan's (1986; in press) serial network has 
produced encouraging results for understanding 
the dynamics of intergestural timing in speech 
production. However, as already discussed, his 
speedi simulations were defined with respect to a 
standard list of phonetic features, and were not 
related eiq>licitly tr« actual articulatoiy movement 
patterns. We plan to incorporate such a serial 
network into our speech modeling as a means of 
patterning the gestural activation intervals in the 
task-dynamic model summarized in Equation (5). 
The resultant hybrid dynamical system (Figure 2) 
for articulatory control and coordination should 
provide a viable basis for further theoretical 
developments, guided empirical findings in the 
speech production literature. For example, it ir 
clear that the hybrid model must be able to 
accommodate data on the consequences for 
intergestural timing of mechaniuJ perturbations 
delivered to the articulators during speaking. 
Without feedback connections that directly or 
indirectly link the articulators to the intergestural 
level, a mechanical perturbation to a limb or 
speech articulator could not alter the timing 
structure of a given movement sequence. Recent 
data from human subjects on unimanual 
oscillatory movements (Kay, 1986; Kay, Saltzman, 
& Kelso, 1989) and speech sequences (Gracco & 
Abbs, in press) demonstrate that transient 
mechanical perturbations induce systematic shifts 
in the timing of subsequent movement elements. 
In related animal studies (see footnote [3]), 
transient muscle-nerve stimulation during 
swimming movements of a turtle's hindlimb were 
also shown to induce phtie shifts in the locomotor 
rhythm. Taken together, such data provide strong 
evidence that functional feedback pathways exist 
from the articulators to the intergestural level in 
the control of sequential activity. These pathways 
will be incorporated into our hybrid dynamical 
model (see the lighter pathway indicated in Fipxre 
2). 

Intrinsic vs. e\1rinsic timing: Autonomous 
vs. nonautonomous dynamics 

As discussed earlier (.jee Oestural Activation 
Coordinat€$ section) there are two time spans 



associated with every gesture in the current 
model. The first is the gestural settling time, 
defined as the time required for an idealized, 
temporally isolated gesture to reach a certain 
criterion percentage of the distance from initial to 
target location in tract-variable coordinates. This 
time span is a function of the gesture's intrinsic 
set of djmamic parameters (e.g., damping, 
stiffoess). The second time-span, the gestural 
activation interval, is defined according to a 
gesture's sequence-specific activation function. In 
the present model, gestural activation is specified 
as an explicit function of time in the gestural score 
for a given speech sequence* In the hybrid model 
discussed in the previous section, these activation 
functions would emerge as implicit consequences 
of the serial dynamics intrinsic to a given 
sequence. 

These considerations may serve to clarify 
certain aspects of a relatively longstanding and 
tenacious debate on the issue of intrinsic (e.g.. 
Fowler, 1977, 1980) versus extrinsic (e.g., 
Lindblom, 1983; Lindblom et al, 1987) timing 
control in speech production. In the fVamework of 
the current model, intragestural temporal 
patterns (e.g., settling times, interarticulator 
asynchronies in peak velocities) can be 
chiiracterized unambiguously, at least for isolated 
gestures, as intrinsic timing phenomena. These 
phenomena are emergent properties of the 
gesture-specific dynamics implicit in the 
coordinative structure spanning tract-variable and 
articulator coordinates (Figure 2, interarticulator 
level). In terms of intergestural timing, the issue 
is not so clear and depends on cne's frame of 
reference. If one focuses on Uie interarticulatory 
level, then all activation inputs originate from the 
''outside", and activation timing must be 
considered extrinsic with reference to this level 
Activation timing is viewed as being controlled 
externally according to whatever type of clock is 
assumed to exist or be instantiated at the ^stem's 
intergestural level. However, if one considers both 
levels within the s&me frame of reference then, by 
definition, the timing of activation becomes 
intrinsic to the system as a whole. Whether or not 
this expansion of reference frame is useful in 
furthering our understanding of speech timing 
control depends, in part, on the nature of the clock 
posited at the intergestu^l level. This issue . 
clock structure leads us to a somewhat more 
technical consideration of the relationship 
betv^en intrinsic and e^rinsic timing on the one 
hand, and autonomous and nonautonomous 
dynamical systems on the other hand. 
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For speech production, one can posit that 
intrinsic timing is identified with autonomous 
dynamics, and extrinsic timing with 
nonautonomous dynamics. In an autonomous 
dynamical system, the tenrs in the corresponding 
equation of motion are explicit functions only of 
the system's "intemar state variables (i.e., 
positions and velocities). In contrast, a 
nonautonomous system's equation of motion 
contains terms ^hat are explicit fimctions of 
'extemar ^lock-time, t, such as f (0 s cos (tot) 
(e.g., Haken, 1983; Thompson & Stewart, 1986). 
However, the autonomous-nonautonomous 
distinction is just as susceptible to one's selected 
frame of reference a^ is the distinction between 
intrinsic and extrinsic timing. The reason is that 
any nonautonomous ^stem of equations can be 
transformed into an autonomous one by adding an 
equation(s) describing the dynamics of the 
(formerly) external clock-time variable. That is, 
the frame of reference for defining the overall 
system equation can be extended to include the 
djmamiciS of both the original nonautonomous 
system as well aa the formerly external clock. In 
this new set of equations, a state of unidirectional 
coupling exists between system elements. The 
clock variable affects, but is unaffected by, the rest 
of the system variables. However, when such 
unidirectional coupling exists and the external 
clock meters out time in the standard, linear time- 
flow of everyday clocks and watches, we feel that 
its inclusion as an extra equation of motion adds 
little to our understanding of ^stem behavior. In 
these cases, the nonautonomous description 
probably should be retained 

In earlier versions the present model (Kelso et 
al., 1986a & 1986b; Saltzman, 1986; see also 
Appendices 1 & 2) only temporally isolated 
gestures or perfectly synchronous gesture pairs 
were simulated In these cases, the equations of 
motion were truly autonomous, because the 
parameters at the interarticulatory level did not 
vary over the time course of the simulations. The 
parameters in the present model, however, are 
time-varying functions of the activation values 
specified at the intergestural level in the gestural 
score. Hence, the interarticulatory dynamics 
(Equation [5]) are currently nonautonomous. 
Because the gestural score specifies these 
activation values as explicit (unctions of standard 
clock-time, little understanding is to be gained by 
conceptualiring the ^stem as an autonomous one 
that incorporates the unidirectionally coupled 
dynamics of standard clock-time and the 
interarticulatory level. Thus, the present model 



most sensibly should be considered as 
nonautonomous. This would not be true, however, 
for the proposed hybrid model in which: a) clock- 
time dynamics are nontrivial and intrinsic to the 
utterance-specific serial dynamics of the 
intergestural level; and b) the intergestural and 
interarticulator dynamics mutually affect one 
another. In this case, we posit that much 
understanding is to be gained by incorporating the 
dynamics of both levels into a single set of 
bidirectional ly coupled, autonomous system 
equations. 

INTERGESTURAL COHESION 

As indicated earlier in this article (e.g., in 
Figure 1), speech production entails the 
interleaving through time of gestures defined 
across several different articulators and tract 
variables. In our current simulations, the timing 
of activation intervals for tract-variable gestures 
is controlled through the gestural score. 
Accordingly, gestures unfold i independently over 
time^ producing simulated speech patterns much 
like a player piano generates music. This rule- 
based description of behavior in the vocal tract 
makes no assumptions about coordination or 
functional linkages among the gestures 
themselves. However, we believe that such 
linkages exist, and that they reilect the existence 
of dynamical coupling within certain gestural 
subsets. Such coupling imbues these gestural 
"bundles'* with a temporal cohesion that endures 
over relatively short (e.g.f sublexical) time spans 
during the course of an utterance. 

Support for the notion of intergestural cohesion 
has been provided by experiments that have 
focused on the stracture of correlated variability 
evidenced between tract-variable gestures in the 
presence of oxtemaliy delivered mechanical 
perturbations. Correlated variability is one of the 
oldest concepts in the study of natural variation, 
and it is displayed in a system if ''when slight 
variations in any one part occur..., other parts 
become modified" (Darwin 1896, p. 128). For 
example, in unperturbed speech it is well known 
that a tight temporal relation exists between the 
oral and laryngeal gestures for voiceless 
obstruents (e.g., Lofqvist & Yoshioka, 1981a). For 
example, word-initid aspirated /p/ (in English) h 
produced with a bilabial closing-opening gesture 
and an accompar ying glottal opening-closing 
gesture whose peak coincides with stop release. In 
a perturbation study on voiceless obstruents 
(Munhall et al, 1986), laryngeal compensations 
occurj^^when Uie lower lip was perturbed during 
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the production of the obstruent Specifically, if the 
lower lip was unexpectedly pulled downward just 
prior to oral closure, the laryngeal abduction 
gesture for devoicing was delayed Shaiman and 
Abbs (1987) have also reported data consistent 
with this finding. Such covariation patterns 
indicate a temporal cohesion among gestures, 
suggesting to uc the existence of higher order, 
multigesture uaits in speech production. 

How might intergestural cohesion be 
conceptualized? We hypothesise that such 
temporal stability can be accounted for in terms of 
djmamical coiq)ling structure(s) that are defined 
among gestural units. Such coupling has been 
shown previously to induco stable intergestural 
phase relations in a model of two coupled gestural 
units whose serially repetitive (oscillatory) 
dynamics have been explored both experimentally 
and theoretically in the context ( T rhythmic 
bimanual movements (e.g., Haken, Keljo, & Bunz, 
1985; Kay, Kelso, Saltzman, & SchOner, 1987; 
Scholz, 1986; Schoner, Haken, & Kelso, 1986X 
This type of mode! also provides an elegant 
account of certain changes in intergestural phase 
relationships that occur with increases in 
performance rate in the limbs and, by extension, 
the speech articulators. In speech, such stability 
and change have been examined for bilabial and 
laryngeal sequences consisting of either the 
repeated syllable /pi/ or /ip/ (Kelso, Munhall, 
TuUer, & Saltzman, 1985; also discussed in Kelso 
et al, 1986a, 1986b). When /pi/ is spoken 
repetitively at a self-elected "comfortable* rate, 
the glottal and bilabial component gestures for /p/ 
maintain a stable intergestural phase relationship 
in which peaV glottal opening lags peak oral 
closing by r^n amount that results in typical (for 
English) syllable-initial aspiration of the /p/. For 
repetitive sentences of Tip/ spoken at a similarly 
:omfortable ra^, peak glottal opening occurred 
synchronously wich peak oral closing as is typical 
(for English) of unaspirated (or minimally 
aspirated) syllable-final /p/. When /pi/ was 
produced repetitively at a self-paced increasing 
rate, intergestural phase remained relatively 
Si^le at its comfort value. However, when /ip/ 
was scaled similarly in rate, its phase relation was 
maintained at its comfort value until, at a critical 
speaking rate, an abrupt shift occurred to the 
comfort phase value and corresponding acoustic 
pattern for the /jpi/. 

In the context of the model of bimanual 
movement, the stable intergestural phase values 
at the comfort rate and the phase shift observed 
w.th rate scaling are reflections of the dynamical 



behavior of nonlinearly coupled, higher-order 
oscillatory modes. This use of modal dynamics 
parallels the identification of tract-variables with 
mode coordinates in the present model (see 
Appendix 1). Recall that the dynamics of these 
modal tract-variables serve to organize patterns of 
cooperativity among Cab articulators in a gesture- 
specific manner (see the earlier section entitled 
Model Articulator and Tract Variable 
Coordinat€$). Such interarticulator coordination is 
shaped according to a coupling structure among 
the articulators that is provided by* the tract- 
variable modal dynamics. By extension, patterns 
of intergestural coordination are shaped according 
to inter- tract-variable coupling structures 
"provided by" a set of even higher-order 
multigesture modes. Because tract-variables are 
defined as uncoupled in the present model 
(Equation [Al]), it seems clear that (some sort of) 
inter-tract-variable coupling must be introduced 
to simulate the multigesture functional units 
evident in the production of speech. 

Such multigesture units could play (at least) 
three roles in speech pr luction. One possibility is 
a hierarchical reduction of degrees of freedom in 
the control of the speech articulators b^ond that 
provided by individual tract-variable dynamical 
qrstems ( e.g., Bernstein, 1967). A second, related 
possibility is that multigesture functional units 
are particularly well suited to attaining 
articulatory goals that are relatively inaccessible 
to individual (or uncoupled) gestural units. For 
example, vnthin single gestures the associated 
synergistic articulators presumably cooperate in 
achieving local constriction goals in tract-variable 
pace, and individual articulatory covariation is 
haped by these spatial conbtraints. Coordination 
between tract-variable gestures might serve to 
achieve more global aerodynamic/acoustic effects 
in the vocal tract. Perhaps the most familiar of 
such between-tract-variable effects is that of voice 
onset time (Lisker & Abramson, 1964), in which 
subtle variations in the relative timing of 
laryngeal and oral gestures contribute to 
perceived contrasts in the voicing and aspiration 
char icteristics of stop consonants. 

The third possible role for multigesture imits is 
that of phonological primitives. For example, in 
Browman and (Soldstein's (1986) articulatory 
phonology^ the phonological primitives are 
gestural constellationa that are defined as 
''cohesive bundles" of tract-variable gestures. 
Intergestural cohesion is conceived in terms of the 
stability of relative phasing or spatiotemporal 
relations among gestures within a given 
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eonitellation. In soma cases, constellations 
correspond rather closely to traditional segmental 
descriptions: for example, a word-initial aspirated 
V (in En^ish) is rapresented as a bilabial dosing- 
opening gestura and a glottal opening-closing 
gestiure whose peak coincides with stop ralease; a 
word-initial /s/ (in English) ia raprasented as a 
tongue tip raising-lowering and a glottal opening- 
closing gesture whose peak coincides wiUi miU- 
frication« In other cases, however, it is clear that 
Browman and Goldstein offered a perq)ective that 
is both linguistically radical and empirically 
conservative. Th^ rejected the traditional notion 
of segment and allowed as phonological primitives 
only those gestural constellations that can be 
oLserved directly from physical patterns of 
articulatoiy movements. Thus, in some instances, 
segmental and constellation rapresentations 
diverge. For example, a word-initial /sp/ cluster 
(unaspirated in English) is represented as a 
constellation of wo oral gestures (a tongue-tip 
and bilabial conrtriction-ralease sequence) and a 
single glottal gestura whose peak coincides with 
mid-frication. This raprasentation is based on the 
experimental observation that for such clustera 
only one single-peaked glottal f^stura occurs (e.g., 
lasker, Abramson, Cooper, & Schvey, 1969), and 
thus captures the language-specific phonotactic 
constraint (for English) that thera is no voicing 
contrast for stops following an initial /s/. The 
gestural Constellation raprasentation of /sp/ is 
consequently viewed as superior to a more 
traditional segmental approach which might 
predict two glottal gesturas fbr this sequence. 
Our perapective on this issue is similar to that of 
Browman and Goldstein in that we focus on the 
gestural structura of speech. Like these authors, 
we assume that the underlying phonological 
primitives are context-independent cohesive 
'1>undles* or constellations of gestures whose 
cohesiveness is indexed by stable patterns of 
intergestural phasing. However, we adopt a 
position that, in comparison with theira, is both 
more conservative linguistically and more radical 
empirically. We assume that gesturas cohera in 
bundles corresponding, roughly, to traditional 
segmental descriptions, and tliat these segmental 
uniU maintain their integrity in fluent speech. We 
view many context-dependent modifications of the 
gy»8tural components of these units as emergent 
consequences of the serial dynamics of speech 
production. For example, we consider the single 
glottal gesture accompanying English word-initial 
/sp/ clustera to be a within-tract-variable blend of 
separate glottal gesturas associated with the 



underlying /s/ and /p/ segments (see the following 
section for a detailed discussion of the observable 
kinematic 'traces* left by such underlying 
gestures). 

INTERGESTURAL TIMING PATTERNS: 
EFFECTS OF SPEAKING RATE AND 
SEQUENCE COMPOSITION 

One of the working assumptions in this artide is 
that gestural coproduction is an integral ftature of 
speech production, and that many factors 
influence the degree of gestural overlap found for 
a given utterance. For example, a striking 
phenomenon accompanying increases in speaking 
rate or degree of casualness is that the gestures 
associated with temporally a4]acent segments 
tend to "slide* into one another with a resultant 
increase in temporal overlap (e.g., Browman & 
Goldstein, in press; Hardcastle, 1985; Machetanz, 
1989; Nittrouer, Munhall, Kelso, Tuller, & Harris, 
1988). Such intergestural sliding occurs both 
between and within tract-variables, and is 
influenced by the composition of the segmental 
sequence ai^ well as its rate or casualness of 
production. We turn now to some examples of the 
effects on intergestural sliding and blending of 
changes in speaking rate and sequence 
composition. 

Speaking rate 

Hardcastle (1986) showed with electropalato- 
graphic data that the tongue gestures associated 
with producing the (British English) consonant 
sequence /kV tend to slide into one another and 
increase their temporal overlap with 
experimentally manipulated increasea in speaking 
rate. Many examples of interarticulator sliding 
were also identified some years ago by Stetson 
(1951). Stetson was interested in studying the 
changes in articulatory timing that accompany 
changes in speaking rate and rhythm. Particularly 
interesting are his scaling trials in which 
utterances were spoken at increasing rates. 
Figure 11 is one of Stetson's figures showing the 
time course of lip (L), tongue (T), and air pressure 
(A) for productions of ''sap* at different speaking 
rates. As can be seen, the labial gesture for /p/ and 
the tongue gesture for /s/ are present throu^out 
the scaling trial but their relative timing varies 
with increased speaking rate. By syllable 4 the 
tongue gesture for /s/ and the labial gesture for /p/ 
firom the preceding syllable completely overlap, 
and qrllable identity is altered from then on in the 
trial. 
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FIGURE 62. Abutting Consonants; Continuant with Stop 



Syllables: s^»p. .. 
L— Lip marker. Contact grows shorter and 
lighter as the rate Increases and 
overlapping and coincidence occur. 
T— Tongue marker. Well marked doubling from 
syl. 5-6; thereafter the single releasing 



compound fdmi p$. 
A— Air in mouth. Doubling forms, syl. 5-6. 
AO-^r outskJe. Varied in appearance because 
cf the high pressure during the continuant 
5. Plateau of s becomes mere point as 
compound fbnn appears, syl. 6-7. 



Figure 11. Articttlatoiy and aenxiyiumic reconU takc-i during productiGnt of Ihc svlljiblc 
incKMCn. (from Stetson, 1951; used wiUi publishei's pcnniMion). J ¥ 



as speaking rate 



In tenns of the present theoretical framework, 
these instances of relative sliding can be described 
as occurring between the activation intervals 
associated with tongue dorsum gestures (for the 
velar consonant Ikl), tongue tip gestures (for the 
alveolars /s/ and /I/), and lip aperture gestures (for 
the bilabial /p/). During periods of temporal 
overlap, the gestures sharing articulators in 
common are blended. Because the gestures are 
defiL d in separate tract variables, they are 
observably distinct in articulatory movement 
records. Such patterns of change might be 
interpretable as the response of the hybrid 
dynamical model discussed earlier (see Hybrid 
Model section) to hypothetically simple changes in 
the values of a control parameter or parameter set 
presumably at the model's intergestural level (see 
Figure 2). One goal of future empirical and 
simulation research is to test this notion, and if 
possible, to identify this parameter set and the 
means by which it is scaled with speaking rate. 

Lofqvist & Yoshioka (198 lb) have provided 
eviden^o for similar sliding and blending within 
tract-variables in an analysis of transillumination 
data on glottal devoicing gestures (abdaction- 
adduction sequences) for a native speaker of 
Icelandic (a Germanic language closely related to 
Swedish, English, etc.). These investigators 
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demonstrated intergestural temporal reorga- 
nization of glottal activity with spontaneous 
variation of speaking rate. For example, the cross- 
word-boundary sequence /t#k/ was accompanied 
by a two-peaked glottal gesture at a slow rate, but 
by a single-peaked gesture at fast ratss. The 
interpretation of these data was that iJiere were 
two underlying glottal gestures (one for /t/, one for 
/k/) at both the slow and fast rates. The visible 
result of only a sing e gesture at the fast rate 
appeared to be the simple consequence of blending 
and merging these two highly overlapping, 
underlying gestures defined within the same tract 
variable. These results have since been replicated 
for two speakers of North American English 
during experimentally controlled variations in the 
production rates of /s#t/ sequences (Munhall & 
Lofqvist, 1987). 

Sequence composition: Laryngeal 
gestu "s, oial-laryngeal dominance 

The race scaling data described in the previous 
section for laiyngeal gestures provide support for 
the hypothesis that the single-peaked gestures 
observed at fast speaking rates resulted from the 
sliding and blending of two underlying, 
sequentially adjacent gestures. In turn, this 
interpretation suggests a reasonable account of 

— *7 
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glottal behavior in the production of segmental 
sequences containing fricative-stop clusters. 

Glofcts^ transillumination and speech acoustic 
data for word-fmal /sH/, /ks#e/, and /ps«e/ 
(unpublished data from Fowler, Munhall, 
Saltzman, & Hawkins, 1986a, 1986b) showed that 
the glottal opening-closing gesture for /s*/, in 
comparison to the other cases, was smaller in 
amplitude, shorter in duration, and peaked closer 
in tim«^ to the following voicing onset These 
findings are consistent with the notion that a 
separate glottal gesture was associated with the 
cluster-initial stop, and that this gesture left its 
trace both spatially and temporally in blending 
with the following firicative gesture to produce a 
larger,^^ longer, and earlier- peaking single 
gestural aggregate. Other data from this 
experiment also indicate that the single-peaked 
glottal ^Bstures observed in word-final clusters 
were the result of the blending of two overiapping 
underlying gestures. These data focus on the 
timing of peak glottal opening relative to the 
acoustic intervals (closure for /p/ or /k/, frication 
for /s/) associated with the production of /s*/, /ps#/, 
/ks«/, /sp*/, and /sk#/. For /&#/, the glottal peak 
occurred at mid-frication. However, tor /pn§/ and 
/ks*/ it occurred during the first quarter of 
frication; for /sp*/ and /sk#^, it occurred during the 
third quarter of frication. Tliese data indicate that 
an underlying glottal gesture was preser": for the 
/p/ or /k/ in these word-final clusters that blended 
with the gesture for the /s/ in a way that fulled* 
or ""perturbed* the peak of the gestural aggregate 
towards the V or /k/ side of the cluster. The fact 
that the resultant g^^ttal peak remained inside 
the frication interval for the /s/ may be ascribed, 
by hypothesis, to a relatively greater dominance 
over the timing of the glottal peak associated with 
/a/ compared to that associated with Uie voiceless 
stops /p/ or /k/. 

Dominance refers to the strength of 
hypothesized coupling between oral 
acoustic/articulatory events (e.g., frication and 
closure intervals) and glottal events (e.g., peak 
glottal opening). The dominance tcr a voiceless 
consonant's oral constriction over its glottal 
timing appears to be influenced by (at least) two 
factors. ^2 The first is the manner class of the 
segment: frication intervals (at least for /s/) 
dominate glottal behavior more strongly than stop 
closure intervals. This factor was highlighted 
previously for word-initial clusters 1^ Browman 
and Goldstein (1986; cf., Kingston's [in press] 
related use of oral-laryngeal ^binding" and 
Goldstein's [in press] reply to Kingston). As just 



discussed, this factor also appears to influence 
glottal timing in word-final clusteni. The cacond 
factor is the presence of a preceding word-initial 
boundary: woid-initir^ consonants dominate 
glottal behavior more strongly than the same non- 
word-initial consonant.. These two factors appear 
to have approximately additive effects, as 
illustrated 1^ the following examples of fricative- 
stop sequences defined word- or syllable-initially 
and across word boimdaries. In these cases, as 
was the case word-finally, the notion of dominance 
can be invoked m suggest that the single-peaked 
glottal gestures observed for such clusters are also 
blends of two underlying, overlapping gestures. 

Example 1. In En^^ish, Swedish, and Icelandic 
(e.g., Lofqvist, 1980; Ufqvist & Yoshioka, 1980, 
1981a, 1981b, 1984; Yoshioka, Lofqvist, & Hirose, 
1981), word-initial Mvoiceles8)stop/ clusters and 
l%l are produced witii a single-peaked glottal 
gesture that peaks at mid-fVication. Word-initial 
/p/ is produced with a glottal gesture peaking at or 
slifi^dy before closure release. Thus, the word- 
iwitial position of the In/ in theie clusters 
apparently bolsters the intrinsically high 
*segmentar dominance of the /s/, and elimi 
nates the displacement of the glottal peak toward 
the /p/ that was described earlier for word-final 
clusters. 

Example 2. The rate scaling study for the cross- 
word boundary /s#t/ sequence described earlier 
(Munhall & Lofqvist 1987) showed two single- 
peaked glottal gestures for the slowest speaking 
rates, one double-peaked gesture for intermediate 
rates, and one single-peaked gesture at the fastest 
rate. At the slow and intermediate rates* the first 
peak occurred at mid-frication fbr the /8#/ and the 
second peak occurred at closure release for the 
/#t/. The single peak at the fastest rate occurred at 
the transition between frication offset and closure 
onset These patterns indicate that when the two 
underlying glottal gestures merged into a single- 
peaked blend, the peak was located at a 
''compromise* position between the intrinsically 
stronger /s/ and the intrinsically weaker lil 
augmented by its word-initial status. 

Example 3. In Dutch (Yoshioka, Lofqvist, & 
Collier, 1982), word-initial /#p/ (voiceless, 
unaspirated) is produced with a glottal gesture 
peaking at midclosure, and the glottal peak fbr 
/#s/ and /#sp/ occurs at mid-frication. However, fbr 
/#ps/ (an allowable sequence in Dutch), the glottal 
peak occurs at the transition between closure 
offset and frication onset. Again, this suggests 
that when the inherently stronger /s/ is 
augmented by word-initial status in /#sp/, the 
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glottal peak cannot be perturbed away from mid- 
frication the following /p/. However, when the 
intrinsically weakA^- /p/ is word-initial, the glottal 
peak is pulled by the /p/ (rom mid-frication to the 
closure-fneation boundaiy. 

Bxamph 4. In Swedish (e.g., Lofqvist & 
Yoshioka, 1^30), some word-final voiceless stops 
are aspirated (e.g., /k#/), and are produced with 
glottal gestures peaking at stop release, ^ord- 
initial /#&/ is produced with glottal peak occurring 
at mid-frication (see Example 1). When the c^ss- 
word-boundary sequence /k#s/ is spoken at a 
V'lturar rate, a single glottal gesture is produced 
with its peak occurring approximately at mid- 
frication. This is consistent with the high degree 
of glottal dominance expected for the intrinsically 
stropger /s/ in a word-initial position for the /k#s/ 
sequence. 

These examples provide support for the 
hypothesis that firicative-stop sequences can be 
associated with an underlying ze^ of t /o 
temporally overlapping but slightly offset 
component glottal gestures blended into a single 
gestural aggregate. These examples focused on the 
observable kinematic "traces* evident in the 
timing relations between the aggregate glottal 
peak and the acoustic intervals of the sequence. 
Durational data also suggest that such single 
observable gestures result from a two-gesture 
blending process. For example, the glottal gesture 
for the cluster /#st/ is longer in duration than 
either of the gestures for /#a/ and /#t/ (McGarr & 
Lofqvist, 1988). A similar pattern has also been 
found by Cooper (1989) for word-internal, syllable- 
initial /#s/, /#p/, and /#sp/. 

SUMMARY 

We have outlined an account of speech 
production that removes much of the apparent 
conflict between observations of surface variability 
on the one hand, and the hypothesized existence 
of underlying, invariant gestural units on the 
other hand. In doing so, we hav^. described 
progress made toward a dynamical model of 
speech patterning t\at can produce fluent 
lorestural sequences and specify articulatory 
trigectories in some detail. Invariant units are 
posited in the form of relations between context- 
independent sets of gestural parameters and 
corresponding subsets < f activation, tract-variable, 
and articulatory coordinates in the dynamical 
model. Each gesture's influence over the vaiving 
and shaping of the vocal tract waxes and wanes 
according to the activation strengths of the units. 
Variability ' levgesia the unfolding tract-variable 



and articulatory movemei^ts as a result of both the 
utterance-specific temporal interleaving of 
gestural activations, and the accompanying 
patterns of blending or coproduction. The ; ^lative 
timing* of the gestures and the interaniculator 
cooper, itivity evidenced for a currently active 
gesture set are governed by two functionally 
distinct but interacting levels in the model— the 
intergestural and interarticulatory coordination 
levels, respectively. At present, the dynamics of 
the interarticulatory level are sufficiently well 
developed to offer promising accounts of 
movement patterns observed during unperturbed 
and mechanically perturbed speech sequences, 
and during periods of coproduction. We have only 
begun to explore the dynamics of the iutergestural 
level. Yet even these preliminary considerations, 
grounded in developments in the dynamical 
systems Jterature, have already begun to shed 
light on several longstanding issues in speech 
science, namely, the issues of intrinsic versus 
extrin *- timing, the nature of intergestural 
cohesion, and the hypothesized existence of 
segmental units in the production of speech. We 
find these results encouraging, and look for.vard 
to further progress within this research 
framework. 
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FOOTNOTES 

•Ecological Psychology, 1989, 1(4), 333-382. 

^Department of Communicative Disorders, Elbom College, 
University of Western Ontario, London, Ontario. 

^The term gesture is used, here and elsewhere in this article, to 
denote a member of a family of fui.ctionally equivalent 
articulatory movement patterns that are adivdy controlled with 
reference to a given speech«relevant goal (e.g., a bilabial 
closure). Thus, in \Xb2ge gesture and movement have 
different meanln^^s. Although gestures are composed of 
articulatory movements, not all movei.-ents can be intcpreted 
as gestures or gestural components. 
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^¥oT example, we and others hav^ asserted that several 
coordinate ayitems (e.g., articuUtory and higher-order, goal- 
oriented ooordinateo), and mappings among these coordinate 
systems, must be involved implicitly in the produetiop of 
speech. We have adopted one method of representing these 
mappings expUdtly in the pref^t model a.e, using Jacobians 
and jacobian pseudoinverser; see Appendix 2). We make no 
strong daim, however, as to ihe neural or bdiavioral reality of 
tficse qpedfic methods. 
^An analogous functional partitioning has also been suggested 
in recent ];rfiysiok>gical studies by Lennard and Lennard 
and Hennanson (1985) on cyclic swimming motions of single 
hindlimbe in the turtle In this work, theautnors argued fbra 
modd of the locomotor neural circuit for turtle swimming that 
conaists of two functionally distinct but interacUng 
components. One component, analogous to the present 
interarticulator level, b a central intncyde pattern generator 
(CIPG) that organizes the patterning of muscular activity 
tvithin each locomotor cycle. The second component, 
analogous to the present intergestural levd, is an osdilatory 
central timing network (CTN) that is responsible for 
rhythmically activating or entraining the QPG to produce an 
extended sequence of cydes (see also von Hoist, 1973). A 
related distinction between "motor* and "dock* coordinative 
processes, respectively, has been proposed in the context of 
human manual riiythmic tasks consisting of either continuous 
oscillations at the wrist joints (e.^., Turvey, Rosenblum, 
Schmidt, & Ku^er, 1986) or disaete finger Upping sequences 
(e.g.. Wing, 1980; Wing & Kristofferson, 1973). 
^We do not mean to imply that the production of vocal tract 
oonstricttons and the shaping of articulatory trajectories are the 
prim?Ty goals of speed) production. The functional role of 
speech gescurjs b to cokitrol air pressures and flows in the 
vocal tract so as to produce distinctive patterns of sound. In 
this artide, we emphasize gestural fonn and stability as 
phonetic organizing prindples for the sake of relative 
timplidty. Ultimatdy, the gestural approach must come to 
grips with the aerodynamic sound*producti(Mi requirements of 
speed). 

^inoe the preparation of thb artide, the Usk-dynamic model 
was extended to incorporate control of the tongue-tip (TTCL, 
TTCD), glottal (GLO), and veUc (VEL) constrictions. Hiese 
tract-variables and assodated articulator sets are al^ shown in 
Rgurcs 3 and 4. Results of simulations using ..ese "new* 
gestures have been reported elsewhere in prdiminary form 
(Saltzman. Goldstein, Broivman, k Rubin, 1988a, 1988b). 

^Gestural acdvation pulses are similar functionally to Joos's 
(i^) theorized 'innervation waves*, whose ongoing values 
nflected the strength of vocal tract control assodated with 
various i^onological segments or segmenul components. They 
are also analogous to »he "phonetic influence functions* used 
by M.^ttingly (1981) in the domain of acoustic speech synthesis- 
by-rule. Finally, the activation pulses share with Fowler's 
(19i/3) notion of segmental "prominence* the property of being 
related to the "extent to which vocal tract activity is ^ven ever 
to the production of a particular segmenr (p. 392). 

^Coarticulatory effects could also originate in two simpler ways. 
In the first case, "passive* ooproduction couM remilt from 
carryover effects assodated with the cessation of active gesti*ral 
oontrot due to the inertial sluggishness or time constants 
inherent in the articulatory subsystems (e.g., G3ker, 1976; 



Henke, 1966). Hovyever, neither active nor passive 
ooproduction need be involved in coarticulatory phenomena, at 
least in a theoretical sense. Even if a string of segmenb were 
produced as a temporally discrete (i.e., non-coproduced) 
sequence of target articulatory steady-states, coarticulatory 
effecb on articulatory movement patterns wouM still result. In 
thb second case, context-dependent differences in articulatory 
transitions to a given target would simply reflect 
corresponding differences in the interpolation of trajedories 
from the phonoiogically allowable set of immediately 
preceding targets. Both "sluggbhness" and interpolation 
ooartiailatory effecb appear to be present in the production of 
•ctual 5p»* ch . 

^In Jordan's (1986; in press) model, a given network can learn a 
single set of wdghb that will allow it to produce several 
different sequences. Each .^ch sequence is produced (and 
leamei ) in the presence of a corre s ponding constant activation 
pattern in a set of pfari unib (see Rgure 1(9. Hiese unib provide 
a second set of inputs to the network's hidden layer, in 
additk>n to the inpub provided by the sbte units. We propose, 
however, to use Jordan's modd for cases in which diff^ent seb 
of wdghb are learned for differtuit sequences. In such ases, 
the plan unib are no longer required, and we ignore them in 
tfib artide for purposes of simplidty. 
^To teach the network to perform a given sequence, Jordan 
(1986; in press) first initialized the network to zero, and then 
presented a sequence of teadiing vectors (each corresponding 
to an dement in the intended sequence), delivering one every 
fourth time step. At these times, errors were generated, back- 
propagated through the network, and the set of network 
wdghb were incremenbUy adjusted. During the three time 
steps between each teaching vector, the network wks allowed 
to run free with no imposed teaching constraints. At the end of 
the teaching vector sequence the network was reiiUtiallzed to 
zero, and the entire weight-correction procedure was repeated 
until the sum of the squared output errors fell below a certain 
criterion. After training, the network's performance was tested 
sbrting with the sUte unib set to zero. 

possibility b to construct explidtly a set of serial mini- 
networks that could product: sequentially cohesive, 
multigesture unib- Then a higher order net could be trained to 
produce utterance-specific sequences of sudi unib (e.g., Jordan, 
1985). It b also possible that multigesture unib could arise 
sponUneously as emergent consequences of the learning-phase 
dynamics of oonnectionist, serial-dynamic networvs that are 
trained to produce orchestrated patterns of the simpler gestural 
oomponenb (e.g., Grossberg, 1986; MiyaU, 1987, 1988). Hib b 
deariy an important area to be explored in the development of 
our hybrid dynamical model of speech production (see the 
section entitled Toward u Hybrid Dynamical Model), 

^^Transillumination signab are uncalibrated in terms of spatial 
measurement scale. Consequently, amplitude diff« mces in 
glottal gestures are only suggested, not demonstrated, by 
corresponding differences in 'janffillumlnation signal size. 
Temporal differences (e.g., durations, glotbl peak timing) and 
the spatiotemporal shape (e.g., one vs. two peaks) of 
transillumination signals are reliable indices/reflections of 
gestural kinematics. 

'^It is likely that rate and stress manipulations also have 
systematir eff*:b on oral-glottal coordination. We make no 
daims regarding these potential effecb in this artide, however. 
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APPENDIX 1 
Tract-variable dynamical system 

The tract-variable equations of motion are 
defined in matrix form as follows: 
z = M-l(-Bi-KAz), (Al) 

where z s the m x 1 vector of current tract- 
variable positions, with components listed in 
Figure 4; = the first and second derivatives of 
z with respect to time; M = a m x m diagonal 
matrix of inertia! coefficients. Each diagonal 
element, ma, is associated with the i^^ tract 
variable; B = a m x m diagonal matrix of tract- 
variable damping coefficients; K = a m x m 
diagonal matrix of tract-variable stiffness 
coefficients; and Az = z-Zo where = the target 
or rest position vector for the tract variables. 

By defining the M, B, and K matrices as 
diagonal, the equations in (Al) are uncoupled In 
this sense, the tract variables are assumed to 
represent independent modes of articulatory 
behavior that do not interact dynamically (see 
Coker, 1976, for a related use of articulatory 
modes). In current simulations, M is assumed to 
be constant and equal to the identity matrix (m^j 
= 1.0 for i =y, otherwise my = 0.0), whereas the 
components of B, K, and Zo vary during a 
simulated utterance according to the ongoing set 
of gestures being produced. For example, 
different vowel gestures are distinguished in 
part by corresponding differences in target 
positions for the associated set of tongue-dorsum 
point attractors. Similarly, vovel and consonant 
gestures are distinguished in part by 
corresponding differences in stiffness 
coefficients, with vowel gestures being slower 
(less stiff) than consonant gestures. Thus, 
Equation (Al) describes a linear system of tract- 
variable equations with time-varying 
coefficients, whose values are functions of the 
currently active gesture set (see the Parameter 
Tuning subsection of the text section Active 
Gestural Control: Tuning and Gating for a 
detailed account of this coefficient specification 
process). Note that simulations reported 
previously in Saltzman (1986) and Kelso et al. 
(1986a, 1986b) were restricted to either single 
''isolated* gestures, or synchronous pairs of 
gestures defined across different tract variables, 
e.'7., single bilabial closures, or synchronous 
"vocalic" tongue-dorsum end "consonantal" 
bilabial gestures. In these instances, the 
coefficient matrices and vector parameters in 
Equation (Al) remained constant (time- 
invariant) throughout each such gesture set. 



APPENDIX 2 

Model articulator dynamical system; 
Orthogonal projection operator 

A dynamical system for controlling the model 
articulators is specified by expressing tract 
variables (z, z) as functions of the 
corresponding model articulator variables (0, 0, 
0). The tvact variables of Equation (Al) are 
transformed into model articulator variables 
using the following direct kinematic 
relationships: 



Z =J(0)0 

i =J(f)0+J(f, 0)0 



(A2a) 
(A2b) 
(A2c) 



where 0 = the nx 1 vector of current articulator 
positions, with components 0j listed in Figure 4; 
z(i) = Q\e current mx 1 tract-variablii position 
vector expressed as a function of the current 
model articulator configuration. These 
functions are specific to the particular geometry 
assumed for the set of model articulators used to 
simulate speech gestures or produce speech 
acoustics via articulatory synthesis. J(0) » the 
m xn Jacobian transformation matrix whose 
elements Jij are partial derivatives, dzildBj^ 
evaluated at uie current f. Thus, each row-i of the 
Jacobian represents the set of changes in the t^^ 
tract variable resulting from unit changes in all 
the articulators; and J(0, 0) = idJ{§)/dt\ a m x n 
matrix resulting from differentiating the 
element of J(0) with respect to time. The 
elements of J are functions of both the current § 
and 0. The elements of J and J thus reflect the 
geometrical relationships among motions of the 
model articulators and motions of the 
corresponding tract variables. Using the direct 
kinematic relationships in Equation (A2), the 
equation of motion derived for the actively 
controlled model articulators is as follows: 



0 . = J*(M-l[ - BJ 0 - KAz(0)]) - J* J0, 



(A3) 



ERLC 



where §^ s: an articulatory acceleration vector 
representing the active driving influences on the 
model articulators; M, B, K, J, and J are the 
same matrices used in Equations (Al) and (A2); 
Az(0) = z(0) * Zo, where Zq = the same constant 
vector used in Equation (Al); It should be noted 
that because Az in Equations (Al) and (A3) is not 
assumed to be "small," a differential 
approximation dz = J(0)d0 is not justified and. 
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therefore, Equation (A2a) was used instead for 
the kinematic displacement transformation into 
model articulator variables; J'^ = a n x m 
weighted Jaeobian pseudoinverse (e.g., Benati, 
Gaglio, Morasso, Tagliasco, & Zaccaria, 1980; 
Klein & Huang, 1983; Whitney, 1972), 
J* = W-ljT(JV^-ljTrl where W is a n x n 
positive definite ailiciUatory weighting matrix 
whose elements are constant during a given 
isolated gesture, and superscript T denotes the 
vector or matrix transpose operation. The 
pseudoinverse is used because there are a greater 
number of model articulator variables than tract 
variables for this tosk More specifically, using 
J*^ provides a unique, optimal least squares 
solution for the redundant (e.g., Saltzman, 1979) 
differential transformation from tract variables 
to model articulator variables that is weighted 
according to the pattern of elements in the W- 
matrix. In current modeling, the W-matrix is 
defined to be of diagonal form, in which element 
Wjj is associated with articulator 0j A given set of 
articulator weights implements a corresponding 
pattern of constraints on the relative motions of 
the articulators during a given gesture. The 
motion of a given articulator is constrained in 
direct proportion to the magnitude of the 
^corresponding weighting element relative to the 
remaining weighting elements. Intuitively, 
then, the elements of W establish a gesture- 
specific pattern of relative "receptivities* among 
the articulators to the driving influences 
generated in the tract-variable state space. In the 
present model, J*^ has been generalized to a form 
whose elements are gated functions of the 
currently active gesture set (see the 
Transformation gating subsection of the text 
section Active gestural control: Tuning and 
gating for details). 

In Equation (A3), the first and second terms 
inside the inner parentheses on the right hand 
side represent the articulatory acceleration 
components due to system damping (0^) and 
stiffness (0,), respectively. The rightmost term 
on the right hand side represents an acceleration 
component vector (0yp) that is nonlinearly 
proportional to the squares and pairwise products 
of current articulatory velocities (e.g., #2^3, 
etc.; for further details, see Kf Ib^ et al., 1986a, 
19861 altzman, 1986; Saltzman & Kelso, 1987). 

In early simulations of unperturbed discrete 
speech gestures (e*g., bilabial closure) it was 
found that, a^r a given gestural target (e.g., 
degree of lip compression) was attained and 
maintained at a steady value, the articulators 



continued to move with very small but non- 
negligible (and undesirable) velocities. In 
essen ), the model added to the articulator 
moven ents just those patterns that resulted in no 
tract-variable (e.g., lip aperture) motion above 
and beyond that demanded by the task. The 
source of this residual motion was ascertained to 
reside in the nonconservative nature of the 
pseudoinverse (J'^; see Equation [A3]) of the 
Jaeobian transformation vJ) u^ed to relate tract- 
variable motions and model articulator motions 
(Klein & Huang, 1983). By nonconservative, we 
mean that a closed path in tract-varinble space 
does not imply generally a closed path in model 
articulator space. 

These undesired extraneous model-articulator 
motions were eliminated by including 
supplementary dissipative forces proportional to 
the articulatory velocities. Specifically, the 
orthogonal projection operator^ (1^- J*J]), where 
Iq is a n X n identity matrix (Ballieul, 
Hollerbach & Brocket, 1984; Klein & Huang, 
1983) was used in the following augmented form 
of Equation (A3): 

0^ = J*( M-H - BJ0 - KAz(0)]) ~ 

J*J0 + (I^-J*J))0^, (A4) 

where 0^ = Bj,|0 represents an acceleration 
damping vector, and Bj,^ is a n x n diagonal 
matrix whose components, bj^jj, serve as constant 
damping coefficients for the j^^ component of 
0. The subscript N denotes the fact that B|^ is the 
same damping matrix as that used in the 
articulatory neutral attractor (see the text section 
on Nonactive Gestural Control, Equations [4] and 
t5]). 

Using Equation (A4), the model generates 
movements to tract-variable targets with no 
residual motions in either tract-variable or 
model-articulator coordinates. Significantly, 
the model works equally well for both the case of 
unperturbed gestures, and the case in which 
gestures are perturbed by simulated external 
mechanical forces (see the text rection Gestural 
primitives). In the present model, the identity 
matrix (Iq) in Equation (A4) has been 
generalized, like J"^, to a form whose elements 
are gated functions of che currently active 
gesture set (see the Transformation gating 
subsection of the text section Active gestural 
control: Tuning and gating). 

The dan.ping coeftiLients of are typically 
assigned equal values for all articulators. This 
results in synchronous movements (varying in 
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amplitudes) for the tract variables and 
articulators involved in isolated gestures. 
Interesting patterns emerge, however, if the 
coefficients are assumed to be unequal for the 
various articulators (Saltzman et al., 1987). For 
example, the relatively sluggish rotations of the 
jaw or horizontal motions of the lips may be 
characterized by larger time constants than the 
lips' relatively brisk vertical motions. 
Implementing these asymmetries into Bj^, 
interarticulator ^synchronies within single 
speech gestures are generated by the model t^at 
mirror, partially, some patterns reported in the 
literature. For example, Gracco and Abbs (1986) 
showed that, during bilabial clo&ing gestures for 
the first /f/ in /saepspl/, the raising onsets and 
peak velocities of the component articulatory 
movements occur in the order: upper lip, lower 
lip, and jaw. The peak velocities conform to this 
order more closely than the raising onsets. In 
current simulations of isolated bilabial gestures, 
the asynchronous pattern of the peak velocities 
(but not the movement onsets) emerges naturally 
when the elements of B^^ are unequal. 
Interestingly, the tract*variable trtgectories are 
identical to those generated when Bf^^n elements 
are equal. Additional simulations have revealed 
that patterns of closing onsets may become 
asynchronous, however, depending on several 
factors, e.g., the direction and magnitude of the 
jaVs velocity prior to the onset of the closing 
gesture. 

APPENDIX 3 

Competitive network equations for 
parameter tuning 

The po«/blending activation strengths ( p^.^ 
and Py^iu^ ) defined in text Equation (2) are given 
by the steady-state solutions to a set of 
feedforward, competitive-interaction-network 
dynamical equations (e.g., Grossberg, 1986) for 



the preblending activation strengths (a^j^) in the 
present model. These equations are expressed as 
follows: 



(A5a) 



^WUg 



IKJ 



(A5b) 



where Bp and denote the maximum values 
allowed tor the pre-blending and post-blending 
activation strengths, respectively. In current 
modeling. Bp ar 3^ are defined to equal 1.0; and 
an and are the lateral .'ahibition and 
"gatekeeper* coefficients, respectively, defined 
in text Equation (2). 

The solutions to Equations (A5a) and (A5b) are 
obtained by setting tiheir left-hand sides to zero, 
and solving for Pj^j^ andpj^^j^^., respectively. 

These solutions are expressed in Equations (2a) 
and (2b). The dynamics of Equation (A5) are 
assume ^ to be *Tast" relative to the dynamics of 
the interarticulator coordination levd 
(Equations [A3] and [A4]). Consequently, 
incorporating the solutions of Equation (A5) 
directly into Equation (1) is viewed as a justified 
computational convenience in the present model 
(see also Grossberg & Mingolla, 1986, for a 
similar computational simplification). 
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1 A GESTURAL PHONOLOGY 

Over the past few years, we have been 
investigating a particular hypothesis about the 
nature of the basic 'atoms' out of which 
phonological structures are formed The atoms are 
assumed to be primitive actions of the vocal tract 
articulators that we call 'gestures/ Informally, a 
gesture is identified with the formation (and 
release) of a characteristic constriction within one 
of the relatively independent articulatory 
subsystems of the vocal tract (i.e., oral, laryngeal, 
velic). Within the oral subsystem, constrictions 
can be formed by the action of one of three 
relatively independent sets of articulators: the 
lips, the tongue tip/blade and the tongue body. As 
actions, gestures have some intrinsic time 
associated with them — they are characterizatioiis 
of movements through space and over time (see 
Fowler et al., 1980). 

Within the view we are developing, phonological 
structures are stable 'constellations' (or 
'molecules', to avoid mixing metaphors) assembled 
out of these gestural atoms. In this paper, we 
examine some of the evidence for, and some of the 
consequences of, the assumption that gestures are 
the basic atoms of phonological structures. First, 
we attempt to establish '^'^t gestures are pre- 
linguistic discrete units of acuon that are inherent 
in the maturation of a developing child and that 
therefore can be harnessed as elements of a 
phonological system in the course of development 
(§ 1.1). We then give a more detailed, formal 
characterization of gestures as phonological units 
within the context of a computational model 
(§ 1.2), and show that a number of phonological 

Thii paper hu bcneilted from criticismi by Cathi Bcst» Alice 
Fabor» Elliot Saltzmaa» Michael Studdcrt-Kcanody» Eric 
Vatikioiis-Batoson, Doug Whalcn and two anonymous 
reviowon. Our thanks to Mark Tiede for help with manuscript 
preparetion, and Zofang Wang for help with the graphics. This 
work was supported bv NSF grant BNS-8620709 and NIH 
granU HD-01994 and NS- 13617 to Haskins Laboratones. 



regularities can be captured by representing 
constellations of gestures (each having inherent 
duration) using gestural scores (§ 1.3). Finally, we 
show how the proposed gestural structures relate 
to proposals of feature geometry (§§ 2 - 3). 

1,1 Gestures as pre-linguistic primitives 

Gestures are units of action that can be 
identified by observing the coordinated 
movements of vocal tract articulators. That is, 
repeated observations of the production of a given 
utterance mil reveal a characteristic pattern of 
constrictions being formed and released. The fact 
that these patterns of (discrete) gestures are 
similar in structure to the nonlinear phonological 
representations being currently postulated (e.g . 
Clements, 1985; Hayes, 11^86; Sagey, 1986), 
together with some of the evidence presented in 
Browman and Goldstein (1986, in press), leads us 
to make the strong hypothesis that gestures 
themselves constitute basic phonological units. 
This hypothesis has the atti . xtive feature that the 
basic units of phonology can be identified directly 
with cohesive patterns of movement within the 
vocal tract Thus, the phonological system is built 
out of inherently discrete units of action. This 
state of affairs would be particularly useful for a 
child learning to speak. If we assume that discrete 
gestures (like those that will eventually function 
as phonological units) emerg€ in the child's 
behavioral repertoire in advance of any 
specifically hnguistic development, then it is 
possible to view phonological development as 
harnessing these action units to be the basic units 
of phonological stru^4.ures. 

The idea that pre-linguistic gestures are 
e:nployed in the service of producing early words 
has been proposed and supported by a number of 
writers, for example. Fry (1966, in Vihman in 
press), Locke (1983), Studdert-Kennedy (1987) 
and Vihman (in press), where what we identify as 
'gestures' are referred to as 'articulatory routines' 
or the like. The view we are proposing extends 
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this approach by hypothesizing that these pre* 
linguistic gestures actually become the units of 
contrast. Additional phonological developments 
involve differentiating and tuning the gestures, 
and developing patterns of interge&tural 
coordination that correspond to larger 
phonological structures. 

The evidence that gestures are pre-linguistic 
units of action can be seen in the babbling 
behavior of young infants. The descriptions of 
infant babbling (ages 6-12 months) suggest a 
predominance of vhat are transcribed as simple 
CV syllables (Locke, 1986; Oiler & Eilers, 1982). 
The 'consonantal* part of these productions can be 
analyzed as simple, gross, constriction maneuvers 
of the independent vocal tract subsystems and 
(within the oral subsysten;) the separate oral 
ariiculator seis. For example, based on frequenc>* 
counts obtained from a number of studies, Locke 
(1983) finds that the consonants in (1) constitute 
the 'core' babbling inventory: these 12 'consonants' 
account for about 95% of the babbles of English 
babies. Similar frequencies obtain in other 
language environments. 

(1) h bdg ptk mn jw s 

These transcriptions are not meant to be either 
systematic phonological representations (the child 
doesn't have a phonology yet), or narrow phonetic 
transcriptions (the child cannot be producing the 
detailed units of its 'target' language, because, as 
noted below, there do not seem to be systematic 
differences in the babbles produced by infants in 
different language environments). Others have 
noted the problems inherent in usinj a 
transcription that assvmies a system of units and 
relations to describe a behavior that lacks such a 
system (e.g., Kent & Murray, 1982; Koopmans-van 
Beinum & van der Stelt, 1986; Oiler, 1986; 
Studdert-Kennedy, 1987). As Studdert-Kennedy 
(1987) argues, it seems likely that these 
transcriptions reflect the production by the infant 
of simple vocal constriction gestures, of the kind 
that evolve into mature phonological structures 
(which is why adults can transcribe them using 
their phonological categories). Thus, /h/ can be 
interpreted as a laryngeal widening gesture and 
/bdg/ as 'gross' constriction gestures of the three 
independent oral articulator sets (lips, tongue tip, 
and tongue body), /ptk/ combine the oral 
constriction gestures with the laryngeal 
maneuver, and /m n/ combine oral constrictions 
with velic lowering. These combinations do not 
necessarily indicate an ability on the part of the 
infant to coordinate the gestures. Rather, any 



accidental temporal coincidence of two such 
gestures would be perceived by the listener as the 
segments in question. 

The analysis outlined above suggests that 
babbling involves the emergence, in the infant, of 
simple constriction gestures of independent parts 
of the vocal tract As argued by Locke (1986), the 
pattern of emergence of these actions can be 
viewod as a function of anatomical and 
neurophysiological developments, rather than the 
beginning of language acquisition, per se. This can 
be seen, first of all, in the fact that the babbling 
inventory and its developmental sequence have 
not been shown to vary as a function of the 
particular language environment in which the 
child finds itself (although individual infants may 
vary considerably from one another in the relative 
frequencies of particular gestures — Studdert- 
Kennedy 1987; Vihman, in press). In fact, in the 
large number of studies reviewed by Locke (1983), 
there appear to be no detectable differences 
(either instrumentally or perceptually) in the 
'consonantal' babbling of infants reared in 
different language environments. (More recent 
studies have found some language environment 
effect on tlie overall long term spectrum of vocalic 
utterances— de Boysson-Bnrdies et al., 1986; and 
on prosody— de Boysson-Bardies et al., 1984. 
Other subtle effects may be uncovered with 
improvement of analytic techniques.) 

Secondly, Locke (1983) notes that the 
developmental changes in frequency of particular 
babbled consonants can likely be explained b> 
anatomical developments. Most of the consonants 
produced by very young infants (less than six 
months) involve tongue body constrictions, usually 
transcribed as velars. Some time shortly afler the 
beginning of repetitive canonical babbling (usually 
in the seventh month), tongue tip and lip 
constrictions begin to outnumber tongue body 
constrictions, with tongue tip constrictions 
eventually dominating. Even deaf infants show a 
progression that is qualitatively similar, at least 
in early stages, although cheir babbling can be 
distinguished from that of hearing infants on a 
number of acoustic measures (Oiler & Eilers, 
19b8). Locke suggests an explanation in terms of 
vocal tract maturation. At birth, the infant's 
'arynx is high, and the tongue virtually fills the 
oral cavity (Lieberman, 1984). This would account 
for th*^ early dominance of tongue body 
constrict]' After the larynx drops, tongue tip 
and lip vouowrictions — without simultaneous 
tongue constrictions — are more readily formed. In 
particular, the closing action of the mandible will 
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then contribute to constrictions at the front of the 
mouth. 

Finally, Locke (1986) notes that the timing of 
the development of repetitive 'syllabic' babbling 
coincides with the emergence of repetitive motor 
behaviors generally. He cites Thelen's (1981) 
observation of 47 different rhythmic activities that 
have their peak frequency at 6-7 months. Locke 
concludes (1986, p. 145) that 'it thus appears that 
the syllabic patterning of babble— like the 
phonetic patterning of its segments— is 
determined mostly by nonlinguistic developments 
of vocal tract anatomy and neurophysiology.' 

The pre-linguistic vocal gestures become 
linguistically significant when the child begins to 
produce its first few words. The child seems to 
notice the similarity of its babbled patterns to the 
speech s/he hears (Locke 1986), and begins to 
produce 'words' (with apparent referential 
meaning) using the available set of vocal gestures. 
It is possible to establish that there is a definite 
relationship between the (nonlinguistic) gestures 
of babbling and the gestures employed in early 
words by examining individual differences among 
children. Vihman et al. (1985) and Vihman (in 
press) find that the particular consonants that 
were produced with high frequency in the 
babbling of a given child also appear with high 
frequency in that child's early word productions, 
.hus, the child is recruiting its well-practiced 
action units for a new task. In fact, in some early 
cases (e.g., "baby' words like mama, etc.), 
'recruiting* is too active a notion. Rather, parents 
are helping the child establish a referential 
function with sequences that already exist as part 
of the babbling repertoire (Locke, 1986). 

Once the child begins producing words (complex 
units that have to be distinguished one from 
another) using the available gestures as building 
blocks, phonology has begun to form. If we 
compare the child's early productions (using the 
small set of pre-linguistic gestures) to the gestural 
structure of the adult forms, it is clear that there 
are (at least) two important developments that are 
required to get from one to the other: (1) 
differentiation and tuning of individual gestures 
and (2) coordination of the individual gestures 
belonging to a given word. Let us examine these in 
turn. 

Differentiation and tuning. While the 
repertoire of gestures inherent in the consonants 
of (1) above employs all of the relatively 
independent articulator sets, tie babbled gestures 
involve just a single (presumably gross) 
movement. For example, some kind of closure is 



involved for oral constriction gestures. In general, 
however, languages employ gestures produced 
with a given articulator set that contrast in the 
degree of constriction. That is, not only are closure 
gestures produced but also fricative and wider 
(approximant) gestures. In addition, the exact 
location of the constriction formed by a given 
articulator set may contrast, e.g., in the gestures 
for /6/, /s/ and /J/. Thus, a single pre-linguistic 
constriction gesture must eventually differentiate 
into a variety of potentially contrastive gestures, 
tuned with different values of constriction location 
and degree. Although the location and degree of 
the constriction formed by a given articulator set 
are, in principle, physical continua, the 
differentiated gestures can be categorically 
distinct. The partitioning of these continua into 
discrete categories is likely aided by quantal (i.e.. 
nonlinear) articulatory-auditory relations of thi 
kind proposed by Stevens (1972, 1989). In 
addition, Lindblom (1986) has shown how the 
pressures to keep contrasting words perceptually 
distinct can lead to discrete clustering along some 
articulatory/acoustic contina. Even so, the process 
of differentiation may be lengthy. Nittrouer, 
Studdert-Kenne4y, and McGowan (1989) present 
data on fricative production in American English 
children. They find that differentiation between /s/ 
and /J/ is increasing in children from ages three to 
seven, and hasn't yet reached the level shown by 
adults. In addition, tuning may occur even where 
differentiation is unnecessary. That is, even if a 
language has only a single tongue tip closure 
gesture, its constriction location may be tuned to a 
particular language-specific value. For example, 
English stops have an alveolar constriction 
location, while French stops are more typica/y 
dental. 

Coordination. The various gestures that 
constitute the atoms of a given word m'.st be 
organized appropriately. There is some evidence 
that a child can know what all the relevant 
gestures are for some particular word, and can 
produce them all, without either knowing or being 
able to produce the appropriate organization. 
Studdert Kennedy (1987) presents an example of 
this kind from Ferguson and Farwell (1975). They 
list ten attempts by a 15-month old girl to say the 
word pen in a half-hour session, as shown in (2): 

(2) [ma^, \ de^ , hm, «bO, p^m, thnthnthn, bah, 
4**au«, bua] 

While these attempts appear radically different, 
they can be analyzed, for the most part, as the set 
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of gestures that constitute pen misarranged -n 
various ways: glottal opening, bilabial closui 
tongue body lowering, alveolar closure and velum 
lowering. Eventually, the Vight' organization is hit 
upon by the child. The search is presumably aided 
by the fact that the coordinated structure 
embodied in the target language is one of a 
relatively small number of dynamically stable 
patterns (see also Boucher, 1988). The formation 
of such stable patterns may ultimately be 
illuminated by research into stable modes in 
coordinated human action in general (e.g., Haken 
et al., 198S; Schmidt et al., 1987) being conducted 
within the broad context of the non-linear 
dynamics relevant to problems of pattern 
formation in physics and biology (e.g.. Glass & 
Mackey, 1988; Thompson & Stewart, 1976). In 
addition, aspects of coordination may emerge as 
the result of keeping a growing number of words 
perceptually distinct using limited articulatory 
resources (Lindblom et al., 1983). 

If phonological structures are assumed to be 
organized patterns of gestural units, a distinct 
methodological bonus obtains: the vocal behavior 
of infants, even pre-linguistic behavior, can be 
described using the same primitives (discrete 
units of vocal action) that are used to describe the 
fully elaborated phonological system of adults. 
This allows the growth of phonological form to be 
precisely monitored, by observing the development 
of the primitive gestural structures of infants into 
the elaborated structures of adults (Best & 
Wilkenfeld, 1988 provide an example of this). In 
addition, some of the thorny problems associated 
with the transcription of babbling can be obviated. 
Discrete units of action are present in the infant, 
and can be so represented, even if adult-like 
phonological structures have not yet developed. A 
similar advantage applies to describing various 
kinds of 'disor^'ered' speech (e.g., Kent, 1983; 
Marshall et al., 1988), which may lack the 
organization shown in 'normal' adult phonology, 
making conventional phonological/phonetic 
transcriptions inappropriate, but which ma>, 
nevertheless, be composed of gestural primitives. 
Of course, all this assumes that it is possible to 
give an account of adult phonology using gestures 
as the basic units — it is to that account that we 
now turn. 

12 The natiire of phonological gestures 

In conjunction with our colleagues Elliot 
Saltzman and Philip Rubin at Haskins 
Laboratories, we are developing a computational 
model that produces speech beginning with a 



representation of phonological structures in terms 
of gestures (Browman et al., 1984; Browman et al., 
1986; Browman & Goldstein, 1987; Saltzman et 
al., 1987), where a gesture is an abstract 
characterization of coordinated task-directed 
movements of articulators within the vocal tract 
Each gesture is precisely defined in terms of the 
parameters of a set of equations for a 'task- 
dynamic' model (Saltzman, 1986; Saltzman & 
Kelso, 1987). When the control regime for a given 
gesture is actwe, the equations regulate the 
coordination of the model's articulators in such a 
way that the gestural 'task' (the formation of a 
specified constriction) is reached as the articulator 
motions unfold over time. Acoustic output is 
obtained from these articulator motions by means 
of an articulatory synthesizer (Rubin et al., 1981). 
The gestures for a given utterance are thems^^^^es 
organized into a larger coordinated structure^ or 
constellation, that is represented in a gestural 
score (discussed in § 1.3). The score specifies the 
sets of values of the dynamic parameters for each 
gesture, and the temporal incervals during which 
each gesture is active. While we use analyses of 
articulatory movement data to determine the 
parameter values for the gestures and gestural 
scores, th^'re is nevertheless a striking 
convergence between the structures we derive 
through these analyses and phonological 
structures currently being proposed in other 
frameworks (e.g., Anderson & Ewen, 1987; 
Clements, 1985; Ewen, 1982; Lass, 1984; 
McCarthy, 1989; Plotkin, 1976; Sagev. 1986). 

Within task dynamics, the goal for a given 
ge'sture is specified in terms of independent task 
dimensions, called vocal tract variables. Each 
tract variable is associated with the specific sets of 
articulators whose movements determine the 
value of that variable. For example, one such tract 
variable is Lip Aperture (LA), corresponding to 
the vertical distance between the two lips. Three 
articulators can contribute to changing LA: the 
jaw, vertical displacement of the lower lip with 
respect to the jaw, and vertical displacement of 
the upper Up. The current set of tract variables in 
the computational model, and their associated 
articulators, can be seen in Figure 1. Within the 
task dynamic model, the control regime for a given 
gesture coordinates the ongoing movements of 
these articulators in a flexible, but task-specific 
manner, according to the demands of other 
concurrently active gestures. The motion 
associated with each of a gesture's tract variables 
is specified in terms of an equation for a second- 
order dynamical system. ^ The equilibrium 
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position parameter of the equation [xO] specifies 
the tract variable target that will be achieved, and 
the stiffiiess parameter k specifies (roughly) the 
time required to get to target. These parameters 
are tuned differently for different gestures. In 
addition, their values can be modified by stress. 

Gestures are currently specified in terms of one 
or two tract variables. Velic gestures involve a 
single tract variable of aperture size, as do glottal 
gestures. Oral gestures involve pairs of tract 
variables that specify the constriction degree (7J^ 
TTCD, and TBCD) and constriction location (LP, 
TTCL, and TBCL). For simplicity, we will refer to 
the sets of articulators involved in oral gestures 
using the name of the end-effector, that is, the 
name of the single articulator at the end of the 
chain of articulators forming the particular 
constriction: the LIPS for LA and LP, the tongue 
tip (TT) for gestures involving TTCD and TTCL, 
and the tongue body (TB) for gestures involving 
TBCD and TBCL. As noted above, each tract 



variable is modelled using a separate dynamical 
equation; however, at present the paired tract 
variables use identical stiffness and are activated 
and de-activated simultaneously. The damping 
parameter 6 for oral gestures is always set for 
critical damping — the gestures approach their 
targets, but do not overshoot it, or Ving.' Thus, a 
given oral gesture is specified by the values of 
three parameters: target values for each of a pair 
of tract variables, and a stiffness value (used for 
both equations). 

This set of tract variables is not yet complete, of 
course. Other oral tract variables that need to be 
implemented include an independent tongue root 
variable (Ladefoged & Halle, 1988), and (as 
discussed in § 2.1) variables for coni^roUing the 
shape of TT and TB constrictions as seen in the 
third dimension. Additional laryngeal variables 
are re(;;uired to allow for pitch control and for 
vertical movement of the larynx, required, for 
example, for ejectives and implosives. 



tract variable 


articulators Involved 


LP 


lip protrusion 


upper & lower lips, jaw 


LA 


lip aperture 


upper & lower lips, jaw 


TTCL 


tongue tip constrict location 


tongue tip, body, jaw 


TTCD 


tongue tip constrict degree 


tongue tip, body, jaw 


TBCL 


tongue body constrict location 


tongue body, jaw 


TBCD 


tongue body constrict degree 


tongue body, jaw 


VEL 


velic aperture 


velum 


GLO 


glottal aperture 


glottis 




Figure 1. Tract variables and contributing articulatow of computational model. 
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The representations of gestures employing 
distinct sets of tract variables are categorically 
distinct within the system outlined here. That is, 
they are defined using different variables that 
correspond to different sets of articulators. They 
provide, therefore, an inherent basis for contrast 
among gestures (Browman & Goldstein, in press). 
However, for contrasting gestures that employ the 
same tract variables, the difference between the 
gestures is in the tuned values of the continuous 
dynamic parameters (for oral gestures: 
constriction degree, location and stiffness). That 
is, unlike the articulator sets being used, the 
dynamic parameters do not inherently define 
categorically distinct classes. Nonetheless, we 
assume that there are stable ranges of parameter 
values that tend to contrast with one another 
repeatedly in languages (Ladefoged & Maddieson, 
1986; Vatikiotis-Bateson, 1988). The discrete 
values might be derived using a combination of 
articulatory and auditory constraints applied 
across the entire lexicon of a language, as 
proposed by Lindblom et al. (1983). In addition, 
part of the basis for the different ranges might 
reside in the nonlinear relation between the 
parameter values and their acoustic consequences, 
as in Stevens' quantal theory (Stevens 1972, 
1989).In order to represent the contrastive ranges 
of gestural parameter values in a discrete fashion, 
we employ a set of gestural descriptors. These 
descriptors serve as pointers to the particular 
articulator set involved in a given gesture, and to 
the numerical values of the dynamical parameters 
characterizing the gestures. In addition, they can 
act as classificatory and distinctive features for 
the purposes of lexical and phonological structure. 
Every gesture can be specified by a distinct 
descriptor structure. This functional organization 
can be formally represented as in (3), which 
relates the parameters of the dynamical equations 
to the s]anbolic descriptors. Contrasting gestures 
will differ in at least one of these descriptors. 
(3) Gesture = articulator set (constriction 

degree, constriction location, constriction 

shape, stiffness) 

Constriction Degree is always present, and 
refers to the xO value for the constriction 
degree tract variables (LA, TTCD, TBCD, 
VEL, or GLO). 

Constriction Location is relevant only for 
oral gestures, and refers to the xO value for 
the constriction location tract variables (LP, 
TTCL, or TBCD. 



Constriction Shape is relevant only for oral 
gestures, and refers to the xO value of 
constriction shape tract variables. It is not 
currently implemented. 

Stiffness refers to the k value of the tract 
variables. 

Figure 2 displays the inventory of articulator 
sets and associated parameters that we posit are 
required for R general gestural phonology. The 
parameters correspond to the particular tract 
variables of the model shown below them. Those 
parameters with asterisks are not currently 
implemented. 



Gestures: 
Arti^atOr Dlmenaiona 

LIPS (^^^ d^v^i con location, , stf/Xhoss) 

LA LP 



YJ (con dogrM, con locstlon, con shop«* , stiftrwss) 

TTCD TTCL 



TB (con dogrM, con locstlon, con i h«p«* , stifthess) 

TBCD TBa 



TR* (con dogrM*, cor loc«tlo<i% « anffrmss) 

VEL (con dogm, , stiffness) 
VEL 

GLO (cond«gr«« con locstlon*, , stiffness) 



Figure 2. Inventory of articulator sets and associated 
parameters. 

For the present, we list without comment the 
possible descriptor values for the constriction 
degree (CD) and constriction location (CD 
dimensions in (4). In § 2.1, we will discuss the 
gestural dimensions and these descriptors in 
detail, including a comparison to current 
proposals of featural geometry. 



82 



Articulaton/ Cestu ^ as Phonological Units 



75 



(4) CD descripto/s: closed critical narrow mid 
wide 

CL descriptors: protruded labial dental 
alveolar post-alveolar 
palatal velar uvular 
pharyngeal 

In le phonology of dynamically defined 
articulatory gest ^es that we are developing, 
gestures are posited to the atoms of 
phonological structure. It is importanc to note tb 
such gestures are relatively abstract That is, the 
physically continuous movement trajectories are 
analvzed as resulting from a set of discrete, 
concurrently active gi> tural control regimes. They 
are discrete in two senses: (1) the dynamic 
parameters o. a gesture's control regime remain 
constant throughout the discrete interval of time 
during which the gesture is active, and (2) 
gestures in a language may differ from one 
another in discrete ways, as represented by 
different descriptor values. Thus, as argued in 
Biowman and Goldstein (1986) and Browman and 
Goldstein (in press), the gestures for a given 
utterance, together with their temporal 
patterning, perform a dual function. They 
characterize the actual observed articulator 
movements (thus obviating the need for any 
additional implementation rules), and they also 
function as vvMs of contrast (and more generally 
capture aspects of phonological patterning). As 
discussed in those papers, the gesture as a 
phonological unit differs both from the feature and 
from the segment (or root node in current feature 
geometries). It is a larger unit than the feature, 
being effectively a unitary constriction action, 
parameterize'' jointly hy a linked structure of 
features (descriptor values). Yet it is a smaller 
unit than the segment: several gestures linked 
together are necessary to form a unit at the 
segmental, or higher, levels. 

1.3 Gestural scores: Anovulatory tiers and 
internal duration 

In the preceding section, gi'stures were deflned 
with reference to a dynamical system that shapes 
patterns of articulatory movjmen'^*^. Each gesture 
possesses, therefore, not only ar inherent spatial 
aspect (i.e., a tract variable goal) but also an 
intrinsic temporal aspeo*^ (i.e., a gestural 
stiffness). Much of ; powf»r of the gestural 
approach follows fiom these basic facts about 
gestures (combined with their abstractness), since 
they allow gestv ^s to overlap m time as well as in 
articulator and/or tract variable space (see also 



Beil-Berti & Harris, 1981; Fowler, 1980, 1983; 
Pujim'^ra, 1981a,b). In this section, we show how 
overlap among gestures is represented, and 
demonstrate that si.nple changes in the patterns 
of overlap between neig^hboring gestural units can 
automatically produce q variety of superficially 
different types of phnc*;c «nd phonologic^' 
variation. 

Within the computational mod'^l described 
above, the paltern of organize Jon, or 
constellation, of gestures corresponding to a given 
utterance is embodied in a set of phasing 
principles (soe Kelso & Tiller, 1987; Nittrouer et 
al., 1988) that specify the spatiotemporal 
coordination of ^he gestures (Browman & 
Goldstein, 198*- i'^e pattern of intergestural 
coordination that results from applying the 
phasing principles, along with the interval of 
active control for individual gestures, is displayed 
in a two-dimensional gestural score, with 
articulatory tiers on one dimension and temporal 
information on the other (Browman et al, 1986; 
Browman & Goldstein, 1987). A gestural score for 
the wordpa/.n (pronoui ced [pam]) is diLplayed in 
Figure 3a. As can be s€t : n the figure, the tiers in 
a gestural score, on the vertical axis, represent the 
sets of articulators (or the relevant subset thereof) 
employed by the gestures, while the horizontal 
dimension codes time. 

Tne boxes in Figure 3a correspond to ir*dividual 
gestures, labelled by their descriptor values for 
const' Action degree and constriction location 
(where relevant). For example, the ini*^ial oral 
gesture is a bilabial closure, represented as a 
constriction of the LIPS, with a [closed] 
constriction degree, and a [labial] constriction 
location. The horizontal extent of each box 
represents the interval of time during which that 
particular gesture is active. During these 
activation intei ^als, which are determin'^d in the 
computational model from the phasing principles 
and the inherent stiffne^^ses of each gesture, tha 
particular set of dynamic parameter values that 
defines each gesture is actively contrib iting to 
shaping the movements of the model articulators. 
In Figure 3b, curves rre added that show the 
time-varying tract variable trajectories generated 
by the task dynamic model according to the 
parameters indicated by the boxes. For example, 
during the activation interval of the initial labial 
closure, the curve representing LA (the vertical 
distance between Hps) decreases. As can be 
seen in these curves, the activation intervals 
directly capture something about the durations of 
the movements of the gestures. 
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VEL 



TB 



(a) 



LIPS 



do 
labial 



GLO 



wide 



wide 



narrow 
pharyngeal 



do 
labial 



m ] 




Figure 3. Gestural scor- for palm [paml using box notation, (a) Activation intervals cnly; (b) model geneiated tract 
variable motions added. 
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Figure 4 presents an alternative symbolic 
redisplay^ of the gestural score. Instead of the 
1>ox' notation of Figure 3» a 'point' notation is 
employed that references only the gestura' 
«.escriptors and relative timing of their 'targets/ 
The extent of activation of the gestures in the box 
notation is not indicated. In addition, association 
lines between the gestures have been added. 
These lines indicate which gestures are phased 
with respect to each other. Thus, the display is a 
^orthand representation of the phasing rules 
discussed in Browman and Goldstein (1987). 
The pair of gestures connected by a given 
association line are coordinated so that a specified 
phase of one gesture is synchronized with some 
phase of the other. For example, the peak opening 
of the GLO [wide] gesture (180 degrees' is 
ssmchronized with the release phase (290 degrees) 
of the LIPS [clo labial] gesture. Also important for 
the phace rules is the projection of oral 
constriction gestures onto separate Vowel and 
Consonant tiers, which are not shown here 
(Browman & (]roldstein, 1987, 1988; see also 
Keating, 1985). 

The \se of point notation highlights the 
association lines, and, as we shall see in § 2.2.2, is 
useful for the purpose of comparing gestural 
oreanizations with feature geometry 
representations in which individual units lack any 
extent in time. For the remainder of this section, 
however, we will be concerned with showing the 
extent of tempor'^l overlap of gestures, and 
ihert re will employ the bor notation form of the 
gestuTi score. 

The information represented in the gestural 
score serves to identify a particular lexical entrj'. 
The basic elements in such a lexical unit are the 



gestures, which, as we have already seen, can 
contrast with one another by means of ditfaring 
descriptor values. In addition, gestural scores for 
different lexical items can contrast in terms of the 
presence vs. absence of particular gestures 
(Browman & Goldstein, 1986, in press; Goldstein 
& Browman, 1986). Notice that one implication of 
taking ge :ures as basic units is that the resulting 
lexical representation is inherently under- 
specified, that is, it contains no specifications for 
irrelevant features. When a given articulator is 
not involved in any specified gesture, it is 
attracted to a 'neutral' position specific to that 
articulator (Saltzman et al., 1988; Saltzman & 
Munhall, 1989). 

The fact that each gesture has an extent in time, 
and therefore can overlap with other gestures, has 
a variety of phonological and phonetic 
consequences. Overlap between invariantly 
specified gestures can automatically generate 
contextual variation of superficially different 
sorts: (1) acoustic noninvariance, such l ^ the 
different formant transitions that result when an 
invariant consonant gesture overlaps difiere>^t 
vowel gestures (Liberman & Mattingly, 1985); (2) 
allophonic variation, such as the nasalized vowel 
that is produced by overlap between a syllable- 
final velic opening gesture and the vowel gesture 
(Krakow, 1989); and (3) various kinds of 
Voarticulation,' such as the context-dependent 
vocal tract shapes for reduced schwa vowels that 
result from overlap by the neigV'^ring full vowels 
(Brov*. lan & Goldstein, 1989; Fowler, 1981). Here, 
however, we will focus on the implication.; of 
directly represer 'ng overlap among phonological 
units for the phonological/phonetic alternations in 
fluent speech. 




"pahd [pom] 

Figure 4. Gestural score for palm fpam] using point notation, with association lines added. 
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Browman and Goldstein (1987) have proposed 
that the wide variety of differences that have been 
observed between the canonical pronunciation of a 
word and its pronunciation in fluent contexts (e.g.» 
Brown, 1977; Shockey, 1974) all result from two 
simple kinds of changes to the gestural score: (1) 
reduction in the magnitude of individual gestures 
(in both time and space) and (2) increase in 
overlap among gestures. That paper showed how 
th^^e two very general processes might account 
for variations that have traditionally been 
described as segment deletions, in.'^ertions, 
arsimilations and weakenings. The reason that 
increased overlap, in particular, can account foi 
such differ nt types of altematicas is that the 
articulatory and acoustic consequences of 
increased overlap will vary depending on the 
nature of the overlapping gestures. We will 
illustittte this using some of the examples of 
deletion and assimilation presented in Browman 
and Groldstein (1987), and then compare the 
gestural account with the treatment of such 
examples in non-lir.ear phonological theories that 
do not directly represent overlap among 
phonological units. 

Let us examine what happens when a gestural 
score is varied by increasing the overlap between 
two oral constriction gestures, for example from 
no overlap to complete synchrony. This 'sliding* 
will produce different consequences in the 
articulatory and acoustic output of the model, 
depending on whether the gestures are on the 
same or different articulatory tiers, i.e., whether 
they employ the same or different tract variables. 
If tixe gestures are on different articulately tiers, 
as in til e case of a LIPS closure and a tongue tip 
(TT) closure, then the resulting tract variable 
motion for each gesture will be unaffected by the 
other concurrent gesture. Their tract variable 
goals will be met, regardless of the amount of 
overlap. Howover, with sufficient overlap, one 
gesture may completely obscure the other 
acoustically, rendering it inaudible. We refer tc 
this as gestural ^hiding.' In contrast, when two 
gestures are on the same articulatory tier, for 
example, the tongue tip constriction gestures 
associated with /t/ and /6/, they cannot overlap 
without perturbing each others' tract variable 
motions. The two gestures are in competition — 
they are attempting to do different tasks with the 
identical articulatory structures. In this case, the 
djrnamical parameters for the two overlapping 
gestural control regimes are 1)lended' (Saltzman 
et al., 1988; Saltznan & Munhall, u\ press). 



Browman and Goldstein (1987) presented 
examples of articulations in fluent speech that 
showed the hiding and blending beha ior 
predicted by the model. Examples of hiding are 
transcribed in (5). 

(6) (a) /po^fQkt memQ^ri/ -> [pd^fokmemoTi. 
(b) /sevi;i*plAs/-> [sevm plvs] 

In (Sa), careful listening to a speaker's production 
of perfect memory produced in a fluent sentence 
context failed to reveal any audible /t/ at the end 
of perfect, although the /t/ was audible when the 
two words were produced as separate phrases in a 
word list. This deletion of the final /t/ is an 
example of a general (variable) process in English 
that deletes final /t/ or /d/ in clusters, particularly 
before initial obstruents (Guy, 1980). However, 
t}ie articulatory data foi the speaker examined in 
Browman and Goldstein, (1987) showed that 
nothin was actually deleted from a gestural 
viewpoint The a^;eolar closure gesture at the end 
of 'perfect' wa& produced in the fluent context, 
with much the name magnitude as when the two 
words were produced in isolation, but it was 
completely overlapped by th« constrictions of the 
preceding velar closure and the following labial 
closure. Thus, the alveolar closure gesture was 
acoustically hidden. This increase in overlap is 
represented in Figure 5, which shows the (partial) 
gestural scores posited fcr the two versions of this 
utterance, based on the observed articulatory 
movements. The gestures for the first syllable o^ 
memory (shown as shaded boxes) are well 
separated from the gestures for the last syllable of 
perfect (shown as unshaded boxes) in the word list 
version in Figure 5a, but they slide earlier in time 
as shown in Figure 5b, producing substantial 
overlap among three closure gestures. Note that 
these three overlapping ge^^^ur^s are all on 
separate tiers. 

In (5b), seven plus shows an apparent 
assimilation, rather than deletion, and was 
produced when the phrase was produced at a fast 
rate. Assimilation of final alveolar stops and 
nasals to a following labial (or velar) stop is a 
common connected speech process in English 
(Brown, 1977; Gimson, 1962). Here again, 
however, the articulatoty data in Browman and 
Goldstein (1987) showed that the actual change 
was 'hiding^ due to increased overlap: the alveolar 
closure gesture at the end of seven was still 
produced by the speaker in the 'assimilated' 
version, but it was hidden by the prece 
ding labial fricative and <o*lowing labial stop. 
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'per fect mem ory' 



TB 



(•) TT 



LIPS 




[ f o k t 



TB 



(b) TT 



LIPS 



do 
v«lar 



do 
ilveolar 



! crit 
























t f 


0 k 


m 


e 


m ] 





f ig«r^ 5. Partial gestunl icore for two versions of perfect mt.-nory, posited from obsenred articulator movements. Last 
syllable of perfect shown in unshaded boxes; first syllable of mtmory shown in shaded box*^. Only oral tiers are 
shown, (a) Words spoken in ^ord list; (b) words spoken as part of fluent phrase. 



The changes in the posited gestural score can be 
see^ in Figure 6. Because the velum lowering 
gesture (VEL [wide]) at the end of seven in Figure 
6b overlaps the labial closure, the hidingis 
perceived as assimilation rather than dele 
tion. Evidence for such hid-den gestures (in some 
cases having reduced magnitude) has also been 
provided by a number of electropalato 
graphic studies, where they have been taken as 
evidence of 'partial assimilation' (Barry, 1985; 
Hardcastle & Roach, 197rf; Kohler, 1976 (for 
Grerman)). Thus, from a gestural point of view, 
deletion (5a) and assimilation ^5b) of final 
alveolars may in\oive exactly the same 



process — ^increase of overlap resulting in a hidden 
gesture. 

When overlap is increased between two gestures 
on the same articulatory tier, rather thai on 
different articulatory tiers as in the above 
examples, the increased overlap results in 
blending between the dsmaniical paramete/s of 
the two gestures rather than hiding. The 
trcgect Dries of the tract variables shared by the 
two gestures are affected by the differing amounts 
of overlap. Evidence for such blending can be seen 
in examples like (6). 

(6) /ten 9im2y -> [teijeimz] 
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• se ven plus ' 



VEL 
TB 

W TT 
LIPS 

GLO 



eio 
■liTMlar 



1 erft 






1 dtmal 







P 1 A I ] 



VEL 
TB 

TT 
LIPS 

GLO 



eJo 















[ V m p 1 A I ] 



f i^re 6. Partial geshiial score for two versions of $even pluB, yosited from observed articulator movements* Last 
sy*lable of seven shown in unshaded boxes; p/ws shown in shaded boxes. The starred (alveolar do] gesture indicates 
that laterality is not represented in these scores, (a) Spoken at slow rate; (h) spr'ien at fast rate« 



The apparent assimilation in this case, as well as 
in many other cases that involve conflicting 
requirements for the same tract variables, has 
been characterized by Catford (1977) as involving 
an accommodation between the two units. This 
kind of accommodation is exactly what is 
predicted by parameter blending, assuming that 
the same underlying mechani&m of increa^^ed 
gestural overlap occurs here as in .he examples in 

(5) . The (partial) gestural scores h^iDothesized for 

(6) , show'ng the increase in overlap, are displayed 
In Figure 7. The hypothesis of gestural overlap, 
and consequent blending, makes an specific 
prediction: the observed motion of the TT tract 



variables resulting from overlap and blending 
should differ from the motion exhibited by either 
of individual gestures alone. In particular, the 
location of the constriction should not be identical 
to that of either an alveolar or a dental, but rather 
should fall somewhere in between. If this 
prediction is confirmed, then three (superficially) 
different fluent speech processes — deletiwn of final 
alveolar stops in clusters, assimilati )n of final 
alveolar stops and nasals to following labials and 
velars, and assimilation of final alveolar stops to 
other tongue tip consonants— can all be accounted 
for as the consequence of increasing overlap 
between gestures in fluent speech. 
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'ten themfts' 



TB 

(■) TT 
LIPS 



widu 
palatal 



pulalat 



I elo 
lalVMlar | 





do 








alvaolar 







i m z ] 



TB 
(b) TT 
LIPS 



3 



elo 
aivaolar 



[ t 



5 8 



m z ] 



Ffg«« 7. Hypothesized gestured score for two venions of ten themes, ten shown in unshaded boxes; thetfnes shown ii 
shaded boxes. Only oral tien are shown, (a) Spoken with pause between words; (b) spoken as part of fluent phrase. 



How does the gestural analysis of these fluent 
speech alternations compare with analyses 
proposed by other theories of non linear 
phonology? Assimilations such as those in (6) have 
bien analyzed by Clements (1985) as resulting 
from a rule that operates on sequences of alveomr 
stops or nasals followed by [-i-coronal] consonants. 
The rule, whose effect is shown in (7), delinks the 
place node of the first segment and associates the 
place node of the .cond segment to the first (by 
spreading). 

(7) Manner tier [- cent] [+ cons] 

Supralaryngeal tier |-.... < » 



Place tier 




[+ cor ] 



Since the delinked features are assumed to be 
deleted, by convention, and not realized 
phonetically, the analysis in (7) predicts that the 
place of articulation of the assimilated sequence 
should be indistinguishable from that of the 
second consonant when produced alo-^e. This 
claim differs from that made by the blendinr 
analysis, which predicts that the assimilated 
sequence should show the influence of both 
consonants. These conflicting predictions can be 
directly tested. 

The overlap analysis accounts for a wider range 
of phenomena than just the assimilations H (6). 
The deletion of final alveolars (in clusters.. ? the 
assimilation of final alveolars to following stops in 
(5) were also shown to be cases of hiding due to 
increased gestural overlap. Clement's analysis 
does not handle these additional cases, and cannot 
be extended to do so without major reinter- 
pretation of autosegmental formalisms. To see 
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that this is the case, suppose Clements' analysis is 
extended by eliminating the C+coronal] 
reqiiirement on the second segment (for fluent 
speech). This would produce an assimilation for a 
case like (5b), but it would not be consistent with 
the data showing that the alveolar closure gesture 
is, in fact, still produced. To interpret a delinked 
gesture as one that is articulatorily produced, but 
auditorily hidden, would require a major change 
in the assumptions of the framework. 

Within the framework of Sagey (1986), the cases 
of hiding in (5) could be handled, but the analysis 
would fail to capture the parallelism between 
these cases and the blending case in (6). In 
Sage/s framework, there are separate clasb nodes 
for each of the independent oral articulators, 
corresponding to the articulatory tiers. Thus, an 
assimilation like that in (5b) coidd be handled as 
in (8): the labial node of the second segment is 
spread to the preceding segment's place node, 
effectively creating a 'complex* segment involving 
two articulations. 



(S) 



ROOT 




Supralaryngeal Supralaryngeal 



Place 



Place 



Coronal 



Labial 



However, the example in (6) could not be handled 
in this way. In Sagey's framework, complex 
segtuents can only be created in case there are 
different articulator nodes, whereas in (6), the 
same articulator is involved. Thus, an analysis in 
Sagfi^s framework will not treat the examples in 
(5) and (6) as resulting from a single underlying 
process. If the specific prediction made by the 
overlap and blending analyse for cases like (6) 
proves correct, this would be evidence that a 
unitary process (overlap) is indeed involved — a 
unity that is directly captured in the overlapping 
gesture approach, but not in Sage/s framework. 

Finally we note that, in general, the gestural 
approach gives a more constrained and 
explanatory account of the casual speech changes. 



All changes are hypothesized to result from two 
simple mechanisms, which are intrinsically 
related to the talker^s goals of speed and fluency — 
reduce the size of individual gestures, and 
increase their overlap. The detailed changes thai 
emerge from these processes are epiphenomenal 
consequences of the 'blind' application of these 
principles, and they are explicitly generated as 
such by our model. Moreover, the gestural 
approach makes predictions about the kinds of 
fluent speech variation expected in other 
languages. Given differences between languages 
in the canonical gestural scores for lexical items 
(due to languige-speciflc phasing prnciples), the 
same casual speech processes are predicted to 
have different consequences in different 
languages. For example, in a language such as 
Greorgian in which stops in the canonical form are 
always released, word-flnally and in clusters 
(Anderson, 1974), the canonical gestural score 
should show less overlap bf'tween neighboring 
stops than is case in English. The gestural 
model predicts that overlap between gestures will 
increase in casual speech. But in a language such 
as Georgian, an increase of the same magnitude 
as in English would not be sufficient to cause 
hiding. Thus, no casual speech assimilations and 
deletions would be predicted for such a language, 
at least when the increase of gestural overlap is 
the same magnitude as for English. 

The usefulness of internal duration and overlap 
among phonological elements has begun to be 
recognized by phonologists (Hammond, 1988; 
Sagey, 1988). For example, Sagey (1986, pp. 20- 
21) has argued that phonological association lines, 
which serve to link the features on separate tiers, 
'represent the relation of overlap in time... Thus... 
the elements that they link...must have internal 
duration.' However, in nonge^tural phonologies, 
the consequences of such overlap cannot be 
evaluated without additional principles specifying 
how overlapping units interact. In contrast, in a 
gestural phonology, the nature of the interaction 
is implicit in the definition of the gestures 
themselves, as dynamical control regimes for a 
(model) phj 'cal system. Thus, one of the virtues 
of the gestural approach is that the conseqwnces 
of overlap are tightly constrained and mace 
explicit in terms of a physical model. Explicit 
predictions (e.g., about the relation between the 
constrictions formed by the tongue tip in themes 
and in ten themes, as well as about language 
differences) can be mnde that test hypotheses 
about phonological structures. 
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In summary, a gestural phonology characterizes 
the movements of the vocal tract articulators 
during speech in terms of a minimal set of discrete 
gestural units and patterns of spatiotemporal 
organization among those units, using dynamic 
and articulatory models that make explicit the 
articulatory and acoustic crnsequences of a 
particular gestural organization (i.e.» gestural 
score). We have avgued that a number of 
phonological properties of utterances are inherent 
in these explicit gestur&l constellations, and thus 
do not require postuiation of additional 
phonological structure. Distinctiveness (Browman 
& Goldstein, 1986, in press) and pliable structure 
(Browman & Goldstein, 1988) can both be seen in 
gestural scores. In addition, a number of 
postlexical phonological alternations (Browman & 
Goldstein, 1987) can be better described in terms 
of gestures and changes in their organization 
(overlap) than in terms of other kinds of 
representations. 

2 RELATION BETWEEN GESTURAL 
STRUCTURES AND FEATURE 
GEOMETRY 

In the remainder of this paper, we look more 
closely at the relation between gestural structures 
and recent proposals of feature geometry. The 
comparison shows that there is much overall 
similarity and compatibility between feature 
geometry and the geometry of phonological 
gestures (S 2.1). Nevertheless, there are some 
differences. Most importantly, we show that the 
gesture is a cohesive unit, that is, a coordinated 
action of a set of articulators, moving to achieve a 
constriction that is specified with respect to its 
degree as well as its location. We argue that the 
gesture, and gestural scores, could usefully be 
incorporated into feature geometry (5 2.2). The 
gestural treatment of constriction degree as part 
of a gestural unit leads, however, tc an apparent 
dispari^ with how manner features are currently 
handled iu feature geometry. We conclude, 
therefore (5 3) by proposing a hierarchical tube 
geometry that resolves this apparent disparity, 
and that also, we argue, clarifies the nature of 
manner features and how they should be treated 
within feature geometry, or any phonological 
approach 

2.1 Articulatory geometry and gestural 
descriptors 

In this section, the details of the ^'estural 
descriptor structures— the distinctive categories of 



the tract variable parameters— are laid out. Many 
aspects of these structures are rooted in long 
tradition. Jespersen (1914), for example, 
suggested an 'analphabetic' system of specifying 
articulatory place along the upper tract, the 
articulatory organ involved, and the degree of 
constriction. Pike (1943), in a more elaborated 
analphabetic system, included variables fo 
impressionistic characterization of the 
articulatory movements, including 'crests,' 
'troughs,' and 'glides' (movements between crests 
and troughs). More recently, various authors (e.g., 
Campbell, 1974; Halle, 1982; Sagey, 1986; 
Venneman & Ladefoged, 1973) have argued that 
phonological patterns are often formed on the 
basis of the moving articulator used (the 
articulator set, in the current system). The 
gestural structures described in this paper differ 
from these accounts primarily in the explicitness 
of the functional organization of the articulators 
into articulatory gestures, and in the use of a 
uynamic model to characterize the coordinated 
movements among the articulators. 

The articulatory explicitness of the gestural 
approach leads to a clear-cut distinction between 
features of input and features of output. That is, a 
feature such as 'sonority* has very little to do with 
articulation, and a great deal to do with acoustics 
(Ladefoged, 1988a,b). This difference can be 
captured by contrasting the input to the speech 
production mechanism— the individual gestures in 
the lexical entry — and the output— the 
articulatory, aerodynamic and acoustic 
consequences of combining several ges^ .es in 
different parts of the vocal tract. The gestural 
descriptors, then, characterize the input 
mechanism — they are the 'features' of a purely 
articulatory phonology. Most traditional feature 
systems* however, represent a conflation of 
articrlatory and acoustic properties. In order to 
avoid confusion with such combined feature 
systems, and in order to emphasize that gestural 
descriptors are purely articulatory, we retain the 
non-standard terminology developed in the 
computational model for the category names of the 
gestural descriptor values. 

In i 2.1.1, we discuss the descriptors 
corresponding to the articulator sets and show 
how they are embedded in a hierarchical 
articulatory geometry, comparable to recent 
proposals of feature geometry. In §§ 2.1.2-5, we 
examii.e the other descriptors in turn, 
demonstrating that they can be used to define 
natural classes, and showing how the differences 
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between these descriptors and other feature 
systems stem from their strictly articulatory 
and/or dynamic status. 

2.1.1 Anatomical hierarchy and articulator 
sets 

The gestural descriptors listed in Figure 2 are 
not hierarchically organized. That is, each 
descriptor occupies a separate dimension 
describing the movement of a gesture. However, 
there is an implicit hierarchy for the sets of 
articulators involved. This can be seen in Figure 8, 
which redisplays (most of) the inventorv of 
articulator sets and associated parameter 
dimensions from Figure 2, and adds a grouping of 
gestures into a hierarchy based on articulatory 
independence. Because the gestures characterize 
movements within the vocal tract, they are 
effectively organized by the anatomy of the vocal 
tract. TT and TB gestures share the individual 
articulat rs of tongue body and jaw» so f hey define 
a class of Tongue gestures. Similarly, both UPS 
and Tongue gestures use the jaw, and are 
combined in the class of Oral gestures. Finally, 
Oral, velic (VEL)» and glottal (GLO) constitute 
relatively independent subsystems that combine 
to form a description of the overall Vocal Tract 
This anatomical hierarchy constitutes, in effect, 
an articulatory g^ jmetry. 
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Figure S. Articulatory geometry tree 



The importance of an articulatory geometry has 
repeatedly been noted, for example in the use of 
articulatory subsystems by Abercrombie (1967) 
and Anderson (1974). More recent!v, it has formed 
the basis of a number of propos^als of feature 



geometry (Clements, 1985; Ladefoged, 1988a,b; 
Ladefoged & Halle, 1988; McCarthy, 1988; Sagey, 
1986). The hierarchy of the moving articulators is 
central to all these proposals of featural 
organiiiation, although the evidence discussed 
may be phonological patterns rather than 
physiological structur^^ Leaving aside manner 
features (to be discus, ad in §§ 2.2.1 and 3), the 
various hierarchies differ primarily in the 
inclusion rf a Supralaryngeal node in the feature 
geometries of Clements (1985) and Sagey (1986), 
and a Tongue node in the articulatory geometry 
diagrammed in Figure 8. 

There is no Supralaryngeal node included in the 
articulatory geometry of Figure 8, because this 
geometry effectively organizes the vocal tract 
input. As we will argue in § 3, the Supralaryngeal 
node is important in characterizing Uie output of 
the vocal tract. However, using the criterion of 
articulatory independence, we see no anatomical 
reason to combine any of the three major 
subsystems into a higher node in the input 
hierarchy. Rather than being part of a universal 
geometry, further comblnauors of the laryngeal, 
velic, and Oral subsystems into class nodes such 
as Supralaryngeal, or Central (Oral) vs. 
Peripheral (velic and laryngeal), should be 
invoked where necessitated by language* 
particular organization. For example, the central- 
peripheral distinction v s argued to exist for Toba 
Batak (Hayes, 1986). 

The Tongue node in Figure 8 is proposed on 
anatomical grounds, again using the criterion of 
articulatory independence (and r^iatedness). That 
is, the TT articulator set shares two articulators 
with the TB articulator set— the tongue body and 
the jaw. Put another way, the tongue is an 
integral structure that is clea: ly separate from the 
lips. There is some evidence for this node in 
phonological patterns as well. A number of 
articulatioL? made with the tongue canrot be 
clearty categorized as being made with eithar TT 
or TB. For example, laterals in Kuman (Papua 
New Guinea) alternate between velars and 
coronals (Lynch, 1983, cited in McCarthy, 1988). 
Tliey are always Tongue articulations, but are not 
categorizable as exclusively TT or TB. Rather, 
they require some reference to both articulator 
sets. This is also the case for English laterals, 
which can alternate between syllable-inftial 
coronals and syllabl^-flnal velars (Ladefoged, 
1982). 

Palatal and palatoalveolar consonants are 
another type of articulation that falls between TT 
and TB articulations (Keating, 1988; Ladefoged & 
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Maddieson, 1986; Recasens, in preparation) 
Keating (1988) suggests that the intermediate 
status of palatals— partly TT and partly TB— can 
be handled by treating palatal consonants as 
complex segments, represented under both the 
tongue body and tongue tip/blade nodes. However, 
without the higher level Tongue node, such a 
representation equates complex segments such as 
labio-velars, consisting of articulations of two 
separate articulators (lips and tongue), with 
palatals, arguably a sin^e articulation of a single 
predorsal region of the tongue (Recasens, in 
preparation). With the inclusion of the Tongue 
node, labio-velars and palatals would be similar in 
being double Oral articulations, but different in 
being LIPS plus Tongue articulations vs. two 
Tongue articulations. Thus, the closeness of the 
articulations is reflected in the level of the nodes. 

This evidenc'^ is suggestive, but not conclusive, 
on the role of the Tongue node in phonological 
patterns. We predict that more evidence of 
phonological patterns based on the anatomical 
interdependence of the parts of the tongue should 
exist. One type of evidence should result from the 
feci that one portion of the tongue cannot move 
completely independently of the other portions. 
This lack of independence can be seen, for 
example, in the suggestion that the tongue body 
may have a characteristically more backed shape 
for consonants using the tip/blade in dentals as 
opposed to alveolars (Ladefoged & Maddioson 
1986; Stevens et al., 1986). We would expect to 
find blocking rules based on such considerations. 

2.1.2 Constriction degree (CD) 

CD is the analog within the gestural approach of 
the manner feabure(s). However, it is crucial to 
note that, unlike the manner classes in feature 
geometry, CD is first and foremost an attribute of 
the gesture — the constricHon made by the " >ving 
set of articulators— and therefore, at th^ gestural 
level, is solely an articulatory characterization. 
(This point will be elaborated in § 2.2.1). Ii our 
model, CD is a continuum divided irto the 
followhig discrete ranges: [closed], [crJcal], 
[narrow], [mid], and [wide] (Ladefoged, J 988b, 
refers to such a partitioning of a continuum as an 
'ordered set' of values). The two most closed 
categories :orrespond approximately to acoustic 
stops and fricatives; the names used for these 
categories indicate that these values are 
articulatory rather than acoustic. Thus, the 
second degree of constriction, labelled [critical], 
indicates that critical degree of constriction for a 
gesture at which some particular aerodynamic 
consequences could obtain if there were 



appropriate air flow and muscular tension. That 
is, the critical constriction value permits frication 
(turbulence) or voicing, depending on the set of 
articulators involved (oral or laryngeal). Similarly, 
[closed] refers to a tight articulatorv closure for 
that particulai- gesture; the overall state of the 
vocal tract might cause this closure to be, 
acoustically, either a stop or a sonorant. This, in 
turn, will be determined by the combined effects of 
the concurrently active gestures. 9 3 will discuss 
in greater detail how v;e account for natural 
classes that depend on the consequences of 
combining gestures. 

The categorical distinctions among [closed], 
[critical], and the wider values (as a group) are 
clearly ba«ed on quantal articuUtory-acoustic 
relations (Stevens, 1972). The basis for the 
distinction among wider values is not as easy to 
find in articulatory acoustic relations, although 
[narrow] might be identified with Catford's (1977) 
[approxiuant] category. Catford is also careful to 
define 'articulatory stricture' in solely articulatory 
terms. He defines [approximant] as a constriction 
just wide enough to yield turbulent flow under 
high airflow conditions such t^s in ^pen glottis for 
voicelessness, but laminar flow under low airflow 
conditions such as in voicing. The other 
descriptors are required to distinguish among 
vowels. For example, contrasts among front 
vowels differing in height are represented aa 
[palatal] constriction locations (cf. Wood, 1982) 
wi^h [narrow], [mid], or [wide] CDs, where these 
categories might hf' estaolished on the basis of 
sufHcient perceptual and articulatory contrast in 
the vowel system (Undblom, 1986; Lindblom et 
al, 1983). If additional differentiation is required, 
values are combined to indicate intermediate 
values (e.g., ^^narrow mid]). In addition, [wide] vs. 
[narrow] can be used to distinguish the size of 
glottal aperture— the CD for GLO — associated 
with as.irated and unaspirated stops, 
respectively. 

2.1*3 r jnstricticii location (CD 

Unlike CD, which differs from its featural 
analog of manner by being articulatory rather 
than acoustic, both CL and its featural analog of 
place are articulatory in definition. However, CL 
differs from place features in not being 
hierarchically related to the articulator set That 
is, the set of articulators moving to make a 
constriction, and the location of that constriction, 
are two independent (albeit highly related) 
dimensions of a gesture. Thus, we use the label 
'Oral' rather than 'place' in the articulatory 
hierarchy, not only to emphasize the anatomical 
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nature of the hierarchy, but also to avoid the 
conflation of moving articulator and location along 
the tract that 'place' conveys. 

The constriction location refers to the location 
on the upper or back wall of the vocal tract where 
a gestural constriction occurs, and thus is 
separate from, but constrained by, the articulator 
set moving to make the constriction. Figur > 9 
expands the articulator sets, the CL descriptor 
values and the possible relations between them: 
the locations where a given set of articulators can 
form a constriction. Notice that each articulator 
set maps onto a subset of the possible constriction 
locations (rather than all possible CLs), where the 
subset is determined by anatomical possibility. 
Notice also that there is a non-unique mapping 
between articulator sets and CL values. For 
example, both TT and TB can form a constriction 
at the hard palate; indeed, it may be possible for 
TB to form a constriction even further forward in 
the mouth. Thus constriction location cannot be 
subsumed under the moving articulator hierarchy, 
contrary to the proposals by Ladefoged (1988a) 
and Sagey (1986), among others. 



Articulator 



Constriction Locutions 
proUu^d 




Figure 9. Possible mapping? between articulator sets and 
conitriction locations. 

This can be seen most clearly in the case of the 
[labial] and [dental] locations, where either the 
LIPS or TT articulator sets can form constrictions, 
as shown in (9) (the unusual lin^o-labials are 



discussed ir^ Maddieson 1987). (A siirilar matrix is 
presented in Ladefoged & Maddieson, 1986, but 
not as part of a formal representation.) That is, 
the [labial] constriction location cannot be 
associated exclusively with the LIPS articulator 
set. Similarly, the [dental] constriction location 
cannot be associated exclusively with the TT 
articulator set. Thus, we view CL as an 
independent cross-classifying descriptor 
dimension whose values cannot be hierarchically 
subsumed uniquely under particular articulator 
sets. 



(9) 



CL 



labial 



dental 



Articulator set 
JJES JEL 



bilabial linguo-labial 
labio-dental dental 



We use multivalued (rather than binaiy-valued) 
CL descriptors, following Ladefoged (1988b), since 
the CL descriptor values correspond to categorical 
ranges of the continuous dynamic parameter. In 
the front of the mouth, the basis for the discrete 
ranges of CL presumably involve the actual 
differentiated anatomical landmarks, so that 
parameter values are tuned with respect to them. 
There may, in addition, be relatively invariant 
auditory properties associated with the differe'^t 
locations (Stevens, 1989). For the categories that 
are further back (palatal and beyond). Wood 
(1982) hypothesizes that the distinct CLs emerge 
from the alignment of Stevens' quanta] 
considerations with the positioning possibilities 
allowed by the tongue musculature. To the extent 
that this set of descriptor values is too li.aited, it 
can, again, be extended by combining descriptors, 
e.g., [palatal velar], or by using 'retracted' and 
'advanced' {Prirciples of the IPA, 1S49). 

2*lt4 Constriction shape (CS) 

In some cases, gestures involving the same 
articulator sets and the same values of CL and CD 
may diffe, in the shape of the constriction, as 
looked at in the frontal, rather than scgittal, 
plane. For example, constrictions involving TT 
gestures may differ as to whether they are formed 
with the actual tip or blade of the tongue, the 
shape of the constriction being 'wider' in the third 
dimensi if produced with the blade. The 
importance of tiiis difference has been built into 
rerent feature systems as apical vs. laminal 
(Ladefoged, 1988a), or [distributed] (Sagey, 1986). 
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Some method for controlling such differences 
needs to be built into our computational model. An 
additional TT tract variable (TTR) that specifies 
the orientation (angle) of the tongue tip in the 
sagittal plane with respect to the CL and CD axes 
is currently being incorporated into the task 
dyn;.mic model. It is possible that difTerent 
settings of this variable will be able to produce the 
required apical/laminal difTerences, as well as 
allowing ^he sublaminal contact involved in 
extreme retroflex stops (Ladefoged & Maddieson,' 
1986). 

For TB gestures, an additional tract variable is 
also required to control cross-sectional shape. One 
of the relevant shape differences involves the 
production of laterals, in which at least one of the 
sides of the tongue does not make firm contact 
with the molars. Ladefoged (1980) suggests that 
the articulatoiy maneuver involved is a narrowing 
of the tongue volume, so that it is pulled away 
from the sides of the mouth. Such narrowing is an 
attractive option for an additional TB shaping 
tract variable. In this proposf^l, an alveolar lateral 
would essentially involve two gestures: a TT 
closure, and aTB gesture with a [narrowed] value 
for CS and pertiaps d'^faults for CL and CD. (Some 
flulher specification of TBCD would clearly bo 
required for lateral fricatives, as opposed to 
lateral approximants). In this sense, laterals 
would be complex constellations of gestures, as 
suggested by Ladefoged a ad Maddieson (1986), 
and similar to the proposa' of Keating (1988) for 
treating palatals as complex segments. Another 
role for a TB shaping tract variable might involve 
bunching for rhotics (Ladefoged, 1980). Finally, it 
is unclear whether an additional shape parameter 
is required for UPS gestures. It might be required 
to describe differences between the two kinds of 
rounding observed in Swedish (e.g., Lindau, 1978; 
Linker, 1982). On the other hand, given proper 
constraints on lip shape (Abry & B6e, 1986), it 
may be possible to produce all the required lip 
shapes with just LA and LP. 

2.L5 Dynamic descriptors 

Tlie stiffness (k) of a gesture is a dynamical 
parameter, inferred from articulatory motions, 
that has been shown to vary as a function of 
gestural CD, stress and speaking rate (Browman 
& Goldstein, 1985; Kelso et al., 1985; Ostry & 
Munhall, 1985). In addition, however, we 
hypothesize that stiffness may be tuned 
independently, so that it can serve as the primary 
distinction between two gestures, /j/ and /w/ are 
two cases in which gestural stiffness as an 



additional independent parameter may be 
specified, /j/ is a TB [narrow palatal] gesture, and 
/w/ is a complex formed by a TB (narrow velar] 
gesture and a LIPS [narrow protruded] gesture. 
Our current hypothesis is that these gestures 
have ^he same CD and CL as for the 
corresponding vowels (/i/ and /u/), but that they 
differ in having an [increased] value of stiffness. 
This is similar to the articulatory description of 
glides in Catford (1977). Finally, it is possible that 
the stiffness value that governs the rate of 
movement into a constriction is related to the 
actual biomechanical stiffness of the tissues 
involved. If so, it would be relevant to those 
gestures that, in fact, require a specific muscular 
stiffness: trills and taos likely involve 
characteristic values of oral gesture stiffness, and 
pitch control and certain phonation types require 
specified vocal fold stiffnesses (Halle & Stevens, 
1971; Ladefoged, 1988^). 

2.2 The representation of phonological 
units 

In i 2.1, we laid out the details of gestural 
descriptors and how they are organized by an 
articulatory geometry. Setting aside for the 
moment the critical aspects of gestures discussed 
in § 1.3 (internal duration and overlap), the 
differences between feature geometry and the 
organization of gestural descriptors have so far 
been comparatively minor — primarily the 
proposed Tongue node and the non-hierarchical 
relation between constriction location and the 
moving articulators. In this section, we consider 
two ways in which a gestural analysis suggests a 
different organization of phonological structure 
from that proposed by the feature geometries 
referred to in § 2.1.1. First, we present evidence 
that gestures function as cohesive units of 
phonological structure (§2.2.1). Second, we 
discuss the advantages of the gestural score as a 
phonological notation (§ 2.2.2). 

2J2a The gesture as a unit in phonological 
patterns 

In Figure 8, gestures are in effect the terminal 
nodes of a feature tree. Notice that the 
constriction degree, as well as constriction 
location, constriction shape and stiffness, is part of 
the descriptor bundle. That is, constriction degree 
is specified directly at the articulator node—it is 
one dimension of gestural movement. Looked at in 
terms of the gestural score (e.g.. Figure 3), 
successive units on a given oral tier contain 
specifications for both constriction location and 
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degree of that articulator set. This positioning of 
CD diflers from that proposed in current feature 
geometries. While CL and CS (or their analogs) 
are typically considered to be dependents of the 
articulator nodes, CD (or its analogs) is not. 
Rather, some of the closest analogs to CD — 
[stricture], [continuant], and [sonorant]— are 
usually associated with higher levels, either the 
Supralaryngeal node (Clements, 1985) or the Root 
node (Ladefoged, 1988a,b; Ladefoged & Halle, 
1988; McCarthy, 1988; Sagey, 1986). Is there any 
evidence, then, that the unit of the gesture plays a 
role in phonological organization? Within the 
approach of feature geometry, such evidence 
would consist, for example, of rules in which an 
articulator set and the degree and location of the 
constriction it forms either spread or delete 
together, as a unitary whole. 

It is in fact generally assumed, implicitly 
although not explicitly, that velic and glottal 
features have this type of unitary gestural 
organization. That is, features such as [+nasal] 
and [-voice] combine the constriction degree (wide) 
along with the articulator set (velum or glottis). 
Assuming a default specification for these features 
([-nasal] and [-i-voice]}, then denasalization 
consists of the deletion cf a nasal gesture, and 
intervocalic voicing of the deletion of a laryngeal 
gesture. Additional implicit use of a gestural unit 
can be found in Sagey's (1986) proposal that 
[round] bo subordinate to the LabiaJ node. This is 
exactly the organization that a gestural analysis 
suggests, since [round] is effectively a specification 
of the degree and nature of .ne constriction of the 
lips. 

It is in the case of primary oral gestures that 
proposals positioning CD at the gestural level and 
those positioning it at higher levels contrast most 
sharply. The inherent connection among all the 
component aspects of making a constriction, or in 
other words, the unitary nature of the gesture, can 
be seen mostly clearly when oral gestures are 
deleted, a phenomenon sometimes described as 
delinking of the Place (or Supralaryngeal) node— 
debuccalization. In a gestural analysis, when the 
movement of an articulator to nake a constriction 
is deleted, everything about that constriction is 
deleted, including the constriction degree. 

For example, Thrdin sson (1978, cited in 
Clements & Keyser, 1983) demonstrates that in 
Icelandic, the productive phenomenon whereby 
the first of a sequence of two identical voiceless 
aspirated stops is replaced by ^], consists of 
deleting the first set of suprala igeal features. 



This set of supralaryngeal features corresponds to 
the unit of an oral gesture; it includes the 
constriction degree, whether described as [clo], 
[stop] or [-continuant]. Thus, in this example, the 
entire oral gesture is deleted. 

Another example is cited in McCarthy (in press), 
using data from Straight (1976) and Lombardi 
(1987). In homorganic stop clusters and l Tricates 
(with an intervening word boundary) in Yucatec 
Maya, the oral portion of the initial stop is 
deleted— for example, /k#k/ /h#k/, and 
(affricate) /ts#t/ /s#t/. Again in this case the 
constriction degree is deleted along with the place, 
supporting a gestural analysis: the entire oral 
closure gesture is deleted. Note that McCarthy 
effectively supports this analysis, in spite of his 
positing [continuant] as dependent on the Root 
node, v/hen he says 'a stop becomes a segment 
with no value for [continuant], which is 
incompatible with supraglottal articulation.' 

Thus, there is some clear evidence in 
phonological pattern ig for the association of CD 
with the articulator node. This aspect of gestural 
organization is totally compatible with the basic 
approach of feature geometry, requiring only that 
CD be linked to the articulator node rather than 
to a higher level node. In § 3, we will show how 
this gestural affiliation of CD is also compatible 
with phonological examples in which CD is 
seemingly separable from die gesture— where it is 
'left behind' when a particular articulator-CL 
combination is deleted, or where it appears not to 
spread along with the articulatory set and CL. For 
now, we turn to a second aspect of the gestural 
approach that could usefully be incorporated into 
feature geometry, this one involving the use of the 
gestural score as a two-dimensional projection of 
the inherently three-dimensional phon( logical 
representation. 

2.2.2 Gestural scores, articulatory 

geometiy, and phosological units 

The gestural score, in particular the 'point' form 
in Figure 4, is topologically similar to a nonlinear 
phonological representation. When combined with 
the hierarchical articulatory geometry, the 
gestural score captures several relevent 
dimensions of a nonlinear representation 
simultaneously, in a clear and revealing fashion. 
Figure 10 shows a gestural score for palm, using 
point notation and association lines, combined 
with the articulatory geometry on the left. Note 
that the geometry tree is represented on its side, 
rather than the more usual up-down orientation. 
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*palm*[pom] 

Figure 10. Gettunl score for palm [paml in point notation, with articuUtoiy geometiy. 



This seemingly trivial point in fact is a veiy useful 
consequence of using gestural scores as phono- 
logical notation, as we shall see.It permits spatial 
organization such as the articulatory geometry, 
which is represented on the vertical axis, to be 
separated fi*om temporal information including 
sequencing of phonological units, which is 
represented on the horizontal axis. Such a 
separation is particularly useful in those instances 
in which the gestures in a phonological unit are 
not simultaneous. 

For example, prenasalized stops are single 
phonological units consisting of a sequence of 
nasal specifications. Figure 11a depicts gestural 
score for prenasalized [mb], which contains a 
closure gesture on the LIPS tier associated with a 
sequence of two gestures on the VEL tier, one with 
CD = [wide]' followed by one with CD = [clo]. 
Double articulations also constitute a single 
phonological unit Figure lib displays a gestural 
score for [gB], which consists of a [clo labial] 
gesture on the LIPS tier associated with a [clo 
velar] gesture on the TB tier. Sagey (1986) terms 
the first type of unit a contour segment, and the 
second type a complex segment, a terminology 
that is an apt description of the two figures. 

Compare the representations in Figures 11a and 
lib to those in Figures 11c and lid, which are 
Sage/s (1986) feature geometry specifications of 
the same two phonological units. In the feature 
geometry representation, the two types of 
segme**ts appear to be equally sequential, or 
equally non-sequential. This is a consequence— an 
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unfortunate consequence— of conflating two uses 
of branching notation, that of indicating (order- 
free) hierachical information and that of 
indicating sequencing information. Thus, in 
Figure lid (on the right), the branchinif lines 
represent order-free branching in a hierarchical 
tree. In Figure 11c, however, the branching lines 
represent the associations between two ordered 
elements and a single node on another tier. This 
conflation results from the particular choice of 
how to project an inherently three-dimensional 
phonological structure (feature hierarchy x 
phonological unit constituency x time) onto two 
dimensions. 

In the gestural score, the two-dimensional 
projection avoids this conflation of the two use& of 
branching notation. Here nodes always represent 
gestures and lines are always association ^or 
phasing) lines. Thus any branching, as in Figure 
11a, always indicates temporal information about 
sequencing. The hierarchical information about 
the articulatory geometry is present in the 
organization of the articulatory tiers, and is thus 
represented (once) by a sideway:: tree all the way 
at the left of the gestural score (as in Figure 10). 
In this way, the gestural score retains the virtues 
of earlier forms of phonological notation in which 
sequencing information was clearly 
distinguishable, while also providing the benefits 
of the spatial geometry that organizes the tiers 
hierarchically. It would, of course, be possible to 
adopt this kind of representation within feature 
geome^'^. 
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Finally, the reader may have observed that, in 
discussions to now, constituency in phonological 
units has been indicated solely by using 
association lines between gestures. In particular, 
there has been no separate representation of 
prosodic phonological structure. This is not an 
inherent aspect of r phonology of articulatory 
gestures; rather, it is the result of our current 
research strategy, which is to see how m :h 
structure inheres directly in the relations among 



gestures, without recourse to higher level nodes. 
However, once again, it is possible to integrate the 
gestural score with other types of phonological 
representation. To exemplify how the gestural 
score can be integrated with explicit phonological 
structure. Figure 12 displays a mapping between 
the gestural score Fcr palm and a simplified 
version of the prosodic structure of Selkirk (1988), 
with the articulatory geometry indicated on the 
left 
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In short, although gestures and gestural scores 
originated in a description of articulator 
movement patterns, they nevertheless provide 
constructs that are useful for phonological 
representation. Gestures constitute an 
organization (combining a moving articulator set 
with the degree and location of the constriction 
formed) that can act as a phonological unit. 
Gestural scores provide a useful notation for 
nonlinear phonological representations, one that 
permits phonological constituency to be expressed 
in the same representation with 'phonetic' order 
information, which is indicated by the relative 
positions of the gestures in the temporal matrix. 
The gesturr;s can be grouped into higher-level 
units, using either association lines or a mapping 
onto prc/sodic structure, regardless of the 
simultaneity, sequent'ality, or partial overlap of 
the gestures. In addition, the articulatory 
geometry organizing the tiers can be expressed in 
the same representation. 

3 TUBE GEOMETRY AND 
CONSTRICTION DEGREE HIERARCHY 

So far, we have been treating individual 
gestures as the terminal nodes of an anatomical 
hierarchy. From this perspective, articulatory 
geometry serves as a way of creating natural 
classes of gestures on anatomical grounds. In this 
section, we turn to the vocal tract as a whole, and 
consider how additional natural classes emerge 
from the combined effect (articulatory, 
aerodynamic and acoust'^ consequences) of a set of 
concurrent gestures. We argue Uiat these natural 
classes can best be characterized using a 
hierarchy for constriction degree within the vocal 
tri^ww based on tube geometry. 

Rather than viewing the vocal tract as a set of 
articulators, organized by the anatomy, tube 
geometry views it as a set of tubes, connected 
either in series or in parallel. The sets of 
articulators move within these tubes, creating 
constrictions. In other words, articulatory 
gestures occur within the individual tubes. More 
than one gesture may be simultaneously active; 
together, these gestures interact to determine the 
overall aerodynamic and acoustic output of the 
entire linked set of tubes. Similarities in the tube 
consequences of a set of different gestures may 
lead to similar phonological behavior: this is the 
source of acoustic features (Ladefoged, 1988a) 
such as [grave] and [flat] (Jakobson, Fant, & 
Halle, 1969) that organize different gestures (in 
our terms) according to standing wave nodes in 



the oral tube (Ohala, 1985; Ohala & Lorentz, 
1977). 

Within this tube perspective, there is a vocal 
tract hierarchy that characterizes an 
instantaneous time slice of the output of the vocal 
tract. Such a hierarchy may appear identical to 
the feature geometry of Sagey (1986). There are, 
however, two crucial differences. Unlike Sagey's 
root node, which 'corresponds neither to anatomy 
of the vocal tract nor to acoustic properties' (1986, 
p. 16), the highest node in the '/ocal tract 
hierarchy characterizes the physical state of the 
v^cal tract at a single instant in time. In addition, 
tube geometry characterizes manner — CD — at 
more than just the highest level node. CD is 
characterized at each level of the hierarchy. Thus, 
each of the tubes and sets of compound tubes in 
the hierarchy will have its own effective 
constriction degree that is completely predictable 
from the CD of its constituents, and hence 
ultimately from the CD of the currently active 
gestures. At the vocal tract level, the effective CD 
of the supralaryngeal tract, taken together with 
the initiator power provided by the lungs, 
determines the nature of the actual airflow 
through the vocal tract: none (complete occlu^jion), 
turbulent flow, or laminar flow. This vocal tract 
hierarchy thus serves as the basis for a hierarchy 
for CD. 

The important point here for phonology is that, 
in the ou^ut system, CD exists simultaneously at 
all the nodes in the vocal tract hierarchy— it is not 
isolable to any single node, but rather forms its 
own CD hierarchy. As we argue in § 3.2, all levels 
of the hierarchy are potentially important in 
accounting for phonological regularities, and may 
form the basis of a natural class. First, however, 
in § 3.1, we expand on the nature of tube 
geometry. 

3.1 How tube geometry works 

Figure 13 provides a pictorial representation of 
the tubes that constitute the vocal tract, 
embedded in the space defined by the anatomical 
hierarchy of articulatory geometry, in the vertical 
dimension, and by the constriction degree 
hierarchy of tube geometry, in the horizontal 
dimension. The constriction action of individual 
gestures is schematically represented by the small 
grey disks. T^us, the two dimensions in the figure 
serve to organize the gestures occurring in the 
vocal tract tubes, vertically in terms of 
articulatory geometry, and horizontally in terms 
of tube geometry. Looking at the tube structure in 
the center, we can see that the vocal tract 
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branches into three distinct tubes or airflow 
channels: a Nasal tube, a Central tongue channel 
and a L ;al tongue channel. The complex is 
terminateu by 6L0 gestures at one end. At the 
other end, the Central and Lateral channels are 
together terminated by LIPS gestures. 

The tube geometry tree, shown at the top of the 
figure, reflects the combinations of the tubes and 
their terminators. The tube level of the tree has 
five nodes, one for each of the three basic tubes 
and the two terminators. Within each of these 
basic tubes, the CD will be determined by the CD 
of the gestures acting within that tube, which are 
shown as subordinates to the tube nodes. For 
example, the CD of TT gestures will contribute to 



determining the CD of the Central tube. TB 
gestures will contribute to the Central and/or 
Lateral tube, depending on the constriction shape 
of the TB gesture. Each of the superordinate 
nodes corresponds to a tube junction, forming a 
compound tube from simpler tubes and/or the 
termination of a tube. Thus, the Central and 
Lateral tubes together form a compound tube 
labelled the Tongue tube. The compound Tongue 
tube is terminated by the LIPS configuration, 
which combines with die Tongue tube to form an 
Oral tube. The Oral and Nasal tubes form another 
compound tube, the Supralaryngeal, which is 
terminated by GLO gestures to form the overf.ll 
Vocal Tract compound tube. 




Figure 13, Vocal tract hieraKhy: ArticuLtoiy and tube geometiy. 
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The effective CD at each superoriinate node can 
be predicted from the CD of the txibes being joined 
and the way they are joined. When tubes are 
joined in parallel, the effecti e CD of the 
compound tube has the CD of the widest 
component tube, that is, the maximum CD.When 
they are joined in series, the compound tube has 
the CD of the narrowest component tube, that is, 
the minimum CD. Terminations and multiple 
constrictions within the same tube work like tubes 
connected in series. Using these principles, it is 
possible to 'percolate' CD values from the values 
for individual gestures up to the various nodes in 
the hierarchy. Table 1 shows the possible values 
for CD of individual gestures, and Table 2 shows 
how to determine the CD values at each 
successive node up through the Supralaryngeal 
node, referred to hereafter as the Supra node (the 
Vocal Tract node /ill be discussed below). 

Table 1. Possible constriction degree {CD) values at 
gestural tract variable level : open » » (narrow or mid 
€yR wide). 



a. UPS [CD] sclo crit open 

b. TT [CD] sck) crit open 

c. TB [CD, CS * noimal] > clo crit open 
dTB [CD, CS s narrowed] « clo crit open 
cVEL [CD] «clo open 
f.^LO [CD] s clo ciit open 

GLO [crii] is value appropriate for Voicing 



Table 2. Percolation of CD up through Supralaryngeal 
node. 



a. Nasal 


[CD] 


»VEL[CD] 


Lateral 


[CJ] 


= TB [CD, CS = nanowcd] 


Central 


[CD] 


» Mm (IT [CD], TB [CD, CS = normal]) 


b. Tongue 


[CD] 


« MAX (Central [CD], Utcral [CD]) 


cOral 


[CD] 


= MIN (Tongue [CD], UPS [CD]) 


d Supra 


[CD] 


» MAX (Oral [CD], Nasal [CD]) 



The percolation principles follow from 
aerodynamic considerations. Basically, airflow 
through a tube ^stem will follow the path of least 
resistance, as we can illustrate with examples of 
nasals and laterals. The Oral and Nasal tubes are 
connected in parallel, forming a compound 
Supralaryngeal tube. If the Oral tube is [closed] 



but the Nasal tube is [open], Table 2d indicates 
that the CD of the combined Supra tube will be 
the wider of the two openings, i.e., [open], as it is 
in a nasal stop. To take a less obvious example, 
consider the case where the Nasal tube is [open], 
but the Oral tube is [crit], i.e., appropriate for 
turbulence generation under the appropriate 
airflow conditions. Table 2d predicts that in this 
case as well the Supra CD is [open], implying that 
there will be no turbulence generated. This in 
turn predicts that nasalized fricatives, which 
consist of exactly the VEL [open] and Oral [crit] 
gestures, should not exist. That is, given that, at 
normal airflow rates, the air in this configuration 
will tend to follow the path of least resistance 
through the [open] nasal passage, nasalized 
fricatives would require abnormal degrees of total 
airflow in order to generate airflow through the 
[crit] constriction that is sufficient to produce 
turbulence. 

Ohala (1975) has proposed that such nasalized 
fricativef. are, indeed, rare, precisely because they 
are hard to produce. As Ladefoged and Maddieson 
(1986) argue, this could account for alternations in 
which the nasalized counterparts of voiced 
fricatives are voiced approximants, such as in 
Guaranf (Gregores & Su^rez, 1967, in Ladefoged 
& Maddieson, 1986). However, they also pi-esent 
evidence from Schadeberg (1982) for a nasalized 
labio-dental fricative in Umbundu. This suggests 
that the percolation principles may be overridden 
in certain special cases, perhaps by increased 
airflow settings. 

At the hi^est level, that of the Vocal Tract, the 
glottal (GLO) CD and stiffness and the Supra CD 
combine with initiator (pulmonic) action to 
determine the actual aerodynamic and acoustic 
characteristics of rdrflow through the vocal tract 
At this point, then, it becomes more appropriate to 
label the states of the Vocal Tract in terms of the 
characteristics of this 'output' airflow, rather than 
in terms of the CD, which is one of its determining 
parameters. A gross, but linguistically relevant, 
characterization of the output distinguishes the 
three states defined in Table 3a: occlusion, noise 
and resonance. Assuming some 'average' value for 
initiator power. Table 3b shows how GLO and 
Supra CD jointly determine these properties. A 
complete closure of either system results in 
occlusion. If GLO is [crit] (appropriate position for 
voicing, assuming also appropriate stiffness), then 
it combines with an open Supralaiyngeal tract to 
produce resonance. Any other condition produces 
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noise. In some cases, e.g., GLO [open] and Supra 
[open], this is weak noise generated at the ^ottis 
^aspiration). In other cases, e.g.. Supra [crit], the 
noise is generated in the Oral tube (frication). 
Further details of the output, such as where the 
turbulence is generated, and whether voicing 
accompanies occlusion or noise, are beyond our 
scope here. Ohala (1983) gives many examples of 
how the principles involved are relevant to aspects 
of phonological patterning. 

Table 3. Acoustic consequences at Vocal Tract level. 



a. VT outputs 

occlusion: no airflow thiou^ VT; silence or low- amplitude 
voicing 

noise: turbulent airflow 

resonance: lamin v airflow with voicing; formant structure 

b. VT [CD] » occlusion / Supra [clo] OR GLO [clo] 

s rescsianoe / Siq>ra [open] AND GLO [crit] 
« noise /otherwise 



Finally, noie that the percolation of CD through 
levels of the tube geometry can be defined at any 
instant in time, and depends on the actual size of 
the constriction widiin each tube at that point in 
time, regardless of whether seme gesture is 
actively producing the constriction. Thus, the 
instantaneous CD is the output consequence of the 
default values for articulators not under active 
control, the history of the articulator movements, 
and the constellation of gestures currently active. 
Table 4 shows the default value we are assuming 
for each basic tube and terminator. (The default 
for Lateral is seriously oversimplified, and does 
not givo the ri^t Lateral CD in the case of 
Central fricatives, although percolation to the 
Oral level will worlc correctly). 

Table 4. Default CD values for basic tubes and 
terminators. 



Nasal [CD] s clo 

Lateral [CD] « Central [CD] 

Central lCD]»open 

UPS [CD]»open 

GLO [CD] » crit 

Default values are determined by the effect of the model 
articulators* neutral configuration within a tube 



3.2 The constriction degree hierarchy 

The importance ot tube geometry for phonology 
resides in the fact that constriction degree is not 
isolable to any single level of the vocal tract. 
Rather, constriction degree exists at a number of 
levels simultaneously, with the CD at each level in 
this hierarchy defining a potential natural class. 
In this section, we present some examples in 
which CD from different levels is phonologically 
relevant, and argue that the CD hierarchy is 
important and clarifying for any featural system, 
as well as for gestural phonology. 

We are proposing (1) that there is a universal 
geometry for CD in which all nodes in the CD 
hierarchy are simultaneously present, and (2) that 
different CD nodes are used to represent different 
natural classes. This approach differs from that 
adopted in current feature geometries, in which 
manner features such as [nasal], [sonorant], 
[continuant] or [consonantal] are usucUy 
represented at a single level in the feature 
hierarchy. Different geometries, however, choose 
different levels. For example, [nasal] has been 
variously considered to be a feature dependent on 
the highest (Root) node (McCarthy, 1988), on t!.e 
Supr alary ngeal node (via a manner node) 
(Clements, 1983), or on a Soft Palate node 
(Ladefoged, 1988a,b; Ladefoged & Halle, 1988; 
Sagey, 1986). It is possible that this variability in 
the treatment of [nasal], and Owhei features, 
reflects the natural variation in the CD hierarchy, 
such that different phonological phenomena are 
captured using CD at different levels in the 
hierarchy. 

To see how this might work, consider the four 
hierarchies in Figure 14, which use the tube 
geometry at the top of Figure 13 to characterize 
the linked CD structures for a vowel, a lateral, a 
nrsal, and an oral stop. Since tube geometry plays 
the same role in the representation as the 
articulatory geometry in Figure 10, the tube 
hierarchies are rotated 90 degrees just as the 
anatomical hierarchy was. The circles are a 
graphic representation of the CD for each node, 
with the filled circles indicating CD = [clo], the 
wavy lines indicating CD = [crit], and the open 
circles indicating CD = [open] (the symbols 
indicate occlusion, noise, and resonance, 
respectively, at the Vocal Tract level). These CD 
hierarchies express the various classificatory 
(featural) similarities and dissimilarities among 
the four segments, but they do so at different 
levels. 
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Figure 11. CD hierarchies for vowel, lateral, nasal and oral stop. 



For example, the nasal has the same 
characterization as the vowel and lateral at the 
Supralaryngeal and Vocal Tract le/elo, but 
diverges at lower levels. This aspect of the CD 
hierarchy thus defines a phonological natural 
c^ass consisting of nasals, laterals, and vov.els, 
where the class is characterized by Supra [open] 
and VT ['open' = resonance]. This natural class is 
typically represented by the feature [sonorant], 
althjugh different feature systems select one or 
the other of these levels in their definition of 



sonority. For example, Ladefoged's (1988a,b) 
acoustic feature distinction of [sonorant] utilizes 
the identily of the value in the CD hierarchy at 
the VT level. That is, for Ladefoged [+8onorant] 
difters from [-conorant] in terms of output at the 
VT level. However, for Chomsky and Halle (1968) 
and Stevens (1972), it is the identity at the 
Supralaryngeal level that characterizes sonority, 
since they consider /h/ and /?/ to be sonorants sven 
though they differ from the nasal, vowel and 
lateral at the VT level. 
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At the Oral level (and below), the nasal and stop 
display the same linked CD structure, which 
diffem from that of the lateral and that of the 
vowel. That is, nasals and stops form a natural 
class in having Oral [clo], whereas laterals and 
vowels form a class in having Oral [open]. This 
difference in CD resides at a lower level of the 
hierarchy than the Vocal Tract or Suprala 
ryngeal levels. And at the lowest levels, the Tube 
and Gestural, the nasal differs from the stop, 
lateral, and vowel in having both a Nasal [CD] 
and VEL [CD] that are [open]. Thus, constriction 
degree at various levels can serve to categorize 
and distinguish phonological units in different 
ways. 

Within traditional feature systems, each of the 
natural classes described above receives a 
separate name, which obscures the systematic 
relation among the various classes. Moreover, 
even in feature geometry, when manner features 
are restricted to a single level there is nu 
principled representation of the hierarchical 
relation among the natural classes. However, 
feature geometries sometimes incorporate pieces 
of the CD hierarchy, for example, in the 
assignment of [nasal] and [continuant] as 
dependents of two different nodes in the hierarchy 
(Soft Palate and Root: Sagey, 1986), or the 
assigni/ient of [sonorant] as part of the Root node 
itself, but [continuant] as a dependent on the Root 
node (McCarthy, 1988). Note, however, that 
hierarchical relations among manner classes have 
to be stipulated in feature geometries, whereas 
such relations are inherent in the CD hierarchy. 
In addition, the percolation principles of tube 
geometry provide a mechanism for relating values 
of the different levels to each other, again 
something that would need to be stipulated in a 
hierarchy not based on tube geometry. 

All the levels of the CD hierarchy appear to be 
useful for establishing natural classes, and for 
relating the CD natural classes to one another. In 
addition, various kinds of phonological patterns, 
such as phonological alternations, can be 
examined in light of this hierarchy. In particular, 
we can investigate whether regularities are best 
exT>ressed as processes that treat CD as tied to a 
particular gesture (as an 'input' parameter) or as a 
consequence of gestural combination at the 
various 'output' levels of the CD hierarchy. For 
example, the cases discussed in § 2.2.1 (e.g., /k#k/ 

/h#k/) can be best described as processes that 
delete entire gestures. CD is tied to the gesture 
and is deleted along with the articulator set and 
CL. This deletion automatically accounts (by 



percolation:) ^or the CD changes at other levels of 
the hierarchy. 

In other cases, however, there is much 
overdetermination — the phonological behavior can 
be described equally well from more than one 
perspective. For example, McCarthy (1988) 
discusses the common instances in which /s/ -> /h/, 
and glottalized consonants /p' t' k'/ /?/. Both of 
these examples can be strai^tforwardly analyzed 
as deletion of the oral gesture, as in the cases in 
S 2.2.1. But it is also true that, in both cases, the 
Vocal Tract output CD is unchanged by the 
gestural deletion. In the first case, it remains 
noisy, and in the second it remains an occlusion. 
Thus, the description of the phonological behavior 
could also focus on the (apparent) relative 
independence of the VT [CD] and the articulator 
set involved — the articulator set(s) changes, but 
CD does not. The equivalence of the two 
perspectives results from that fact that the 
percolated values of VT [CD] are the same for the 
two gestures alone and for their combinat'^n. 
Thus, deletion of one or the other will not change 
the VT [CD]. Examples of this kind, of which there 
are likely to be many, can be viewed from a dual 
perspective. 

One example of process for which it seems (at 
first glance) that a dual perspective cannot be 
maintained involves assimilation. Steriade (in 
press) has argued that assimilation is problematic 
for the gestural approach, since sometimes only 
the place, ard not the manner, features are 
assimilated. For example, in Kpelle nasals 
assimilate in place but not manner to a following 
stop or fricative, so that /H-ff becomes a sequence 
(broadly) transcribed as [mv] (Sagey, 1986). As 
Steriade points out, this separation of the place 
and manner features appears to ai^ue against a 
gestural analysis, in which the assimilation 
results from increased overlap between the oral 
gesture for the fricative (LIPS [crit den]) and the 
velum lowering gesture (VEL [wide1). Since the 
nasal does not become a nasal fricative, it would 
appear either that there is no increased overlap, 
or that the LIPS [crit] gesture changes its CD 
when it overlaps the velum lowering gesture. 
From the perspective of the CD hierarchy, the 
Supra CD of the nasal is the same before and after 
the asumilation ([open]), and therefore the Supra 
level (or VT level) is the significant one for CD, 
rather than the Gestural level. However, the 
percolation principles from tube geometry predict 
that the Supra CD will be [open] even if the nasal 
is overlapped by a fricative gesture, as derived in 
(10). 
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(10) 



Supra [open] 




IJPS[crit] 



There could also be a TT gesture in the derivation 
that is either hidden or deleted — the Supra CD 
would still be [open]. Thus, an increase in gestural 
ov«,rlap could produce both the place assimilation 
as well as the correct CD at the Supra level, 
without any change of the Gestural CD. (An 
articulatory study is needed to determine wheliier 
the Gestural CD does indeed change). 

The examples presented so far are processes 
that are either best described with CD attached 
firmly to a given gesture as it undergoes change 
(e.g., deletion or sliding), or are at least equally 
well described that way. However, there are 
phenomena whose description requiies a 
loosening^ of the relation between CD and the 
gesture. One such situation, in which the Gestural 
CD is effectively separated from its articulator set, 
occurs in historical change (Browman & 
Goldstein, in press), in particular in the types of 
historical change that Ohala (1981) terms 
'listener-based' sound c'^anges. Such cases arise 
particularly when gestural overlap leads to 
ambiguity in the acoustic signal that the listener 
can parse into one of two gestural patterns. In 
such cases, Browman and Goldstein (in press) 
argue that there is a (historical) 'reassignment' of 
the constriction degree parameter between 
overlapping gestures. This analysis was used to 
account for the /x/ /ff changes in English words 
like cough, based on an argument originally put 
forth by Pagliuca (1982). Briefly, the analysis 
proposed that the [crit] descriptor for the TB 
gesture for /x/ was re-assigned to an overlapping 
UPS gesture. 

A related synchronic situation, involving two 
overlapping gestures in a complex segment, is 
discussed in Sagey (1986). Her analysis suggests 
that manner features must be represented on 
more than a single node in a feature hierarchy. 



VEL [wide] 



Specifically, Sagey demonstrates that multiple 
oral gestures cohering in a complex segment may 
be restricted to a single distinctive constriction 
degree. She represents this smgle distinctive 
constriction degree at the highest level in the 
hierarchy, but still must represent which 
particular articulation bears the contrastive CD. 
Thus, CD is specified at two different levels. Sagey 
diagrams the relation between the two levels by a 
looping arrow drawn between the distinctive level 
and the 'migor" articulator making the distinctive 
constriction. In the CD hierarchy, the Oral CD 
would be the lowest possible distinctive level for 
these double articulations. The gesture that 
carries the distinctive CD could be marked as 
[head], and would automatically agree in CD with 
the mother node, as in (11). 



(11) 



Oral [aCD] 




Gesture [aCD] 
[head] 



"*Xjesture 



Moreover, in examples such as this, the 
percolation principles do not contribute to 
detennining the CD of the distinctive node, except 
in a negative way to be discussed shortly. Sagey 
argues that the distinctive CD cannot be predicted 
from physical principles, since in Margi labio- 
coronals, the coronal articulation is migor, and 
hence in /psf the less radical constriction is the 
distinctive constriction, rather than the more 
radical one. Thus, the relation between the Oral 
and Gestural levels in (11) must be a statement 
about a functional phonological unit, a gestural 
constellation, rather than a statement 
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characterizing an instantaneous time slice in 
terms of tube geometry. That is, only one of the 
gestures in the constellation may bear a 
distinctive CD. Nevertheless, the importance of 
physical overlap relations can also be seen in this 
example. Maddieson and Ladefoged (1989) have 
shown that complex segments such as [g6] are not 
completely synchronous, apparently lending 
support to the distinction between phonetic 
ordering, on the one hand, and phonological 
unordering, on the other hand. However, we argue 
that the oi'dering of gestures within a complex 
sc;gment is phonologically important in exactly the 
case of a phonological unit like Margi where 
the distinctive value of CD at higher levels is not 
predicted from the lower level CDs by the 
percolation principles. If two gestures overlap 
completely (i.e., are precisely coextensive), and 
there are no additional phonetic cues of the sort 
discussed in Maddieson & Ladefoged, then the 
percolation principles will determine the higher 
level constriction degree throughout the entire 
time-course of the phonological unit, and the 
distinctive CD will fail to be conveyed. For 
example, in the case of if the two gestures 
were precisely aligned, the (distinctive) frication 
would never appear at the VT level. Only if the 
gestures are slightly offset can the distinctive CD 
be communicated. 

In general, the CD hierarchy affords a structure 
within which a typology of phonological processes 
can be developed, based on the CD level that 
seems most relevant It is an interesting research 
challenge to develop this typology, and to ask how 
it is related to other ways of categorizing the 
processes. For example, are there systematic 
differences in relevant CD ]evel that correlate 
with whether the process involves spreading 
(sliding) rules or deletion rules, or with whether 
the process is prelexical or postlexical? 

4 SUMMARY 

We have argued that dynamically-defined 
articulatory gestures are the appropriate units to 
serve as the atoms of phonological representation. 
Gestures are a natural unit, not only because they 
involve task-oriented movements of the 
articulators, but because they arguably emerge as 
prelinguistic discrete uni^ of action in infants. 
The use of gestures, rather than constellations of 
gestures as in Root nodes, as basic units of 
description makes it possible to characterize a 
variety of language patterns in which gestural 
organization varies. Such patterns range from the 
misorde rings of disordered speech through 



phonological rules involving 'jestural overlap and 
deletion to historical changes in which the overlap 
of gestures provides a crucial explanatory 
element 

Gestures can participate in language patterns 
involving overlap becaus^a they are spatiotemporal 
in nature and therefore hav^ internal duracion. In 
addition, gestures differ from current theories of 
feature geometry by including the constriction 
degree as an inherent part of the gesture. Since 
the gestural constrictions occur in the vocal tract, 
which can be characterized in terms of tube 
geometry, all the levels of the vocal tract will be 
constricted, leading to a constriction degree 
hierarchy. The values of the constriction degree at 
each higher level node in the hierarchy can be 
predicted on the basis of the percolation princij les 
and tube geometry. In this way, the use of 
gestures lb atoms can be reconciled with the use 
of constriction degree at various levels in the vocal 
tract (or feature geometry) hierarchy. 

The phonological notation developed for the 
gestural approach might usefully be incorporated, 
in whole or in part» into other phonologies. Five 
components of the notation were discussed, all 
derived from the basic premise that gestures are 
the primitive phonological unitj organized into 
gestural scores. These components include (1) 
constriction degree as a subordinate of the 
articulator node and (2) stiUhess (duration) as a 
subordinate of the articulator node. That is, both 
CD and duration are inherent to the gesture. The 
gestures are arranged in gestural scores using (3) 
articulatory tiers, with (4) the relevant geometry 
(articulatory, tube or feature) medicated to the left 
of the score and (5) structural information above 
the score, if desired. Association lines can also be 
used to indicate how the gestures are combined 
into phonological units. Thus, gestures can serve 
both as characterizations of articulatory 
movement data and as the atoms of phonological 
representation. 
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FOOTNOTES 

• Phonology, 6, 201-151 (1989). 

t Abo, Department of Linguistics, Yale University. 

1 A simple dynamical system consists of a mass attadied to the 
end of a spring— a damped mass-^rin^ model If the mass is 
pulled, stretching the spring beyond its rest length (equilibrium 
position), and then rdeased, the system will begin to oadCate 
The resultant movement patterns of the mass will be a damped 
sinusoid described by the solution to the equation betow. When 
sudi an equation is used to model the movements of coordinated 
sets of articulators, the 'object'— motion variable— in the 
equation is considered to be the tract variable; for example, lip 
aperture (LA). Thxia, the sinusoids trajectory would describe 
how lip aperture dianges over time: 

where m « mass of the object, h « damping of the system, k « 
stiffness of the ipring, xO « rest lengtti of the spring (equilibrium 
position), instantaneous displacement of the object, t* 
instantaneous vdodty of the object, i « instantaneous 
accderation of the object. 

2 F(v ease of reference to the geshire, it is possible to use either a 
bundle of gestural descriptors (or a sdected subset) or gestural 
symbols. We have been trying different approadies to the 
questior of what gestund symbols should be; our present best 
estimate is that gestures should be treated like archiphonemes. 
Thu* our current proposal is to use the Gspitalized form of the 
voiced IPA symbol for oral gestures* capitalized and diacrittzed 
(H) for glottal gestures, and C±N| for (velic do] and (velic open] 
gestures, respectivdy. In order to dearly distinguish gestural 
symbols from other phonetic symbols, vm endose them in curly 
brackets: { ). TlOs approadi shouki permit gestural descriptions 
to draw upon the full symbol resources of IPA, rsthrr than 
attempting to devdop an additional set of symbols. However, 
we welcome comments from others on this dedsion, particularly 
on the proposal to capitalize gestural symbob, a choice thi.. 
minimizes ccmfusion with phonemic transcriptions-^but that 
leads to a conflict with uvular symbob in the current IPA 
system. 
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The Perception of Phonetic Gestures* 



Carol A. Fowler^ and Lawrence D. Rosenblum^t 



W« hav« titled our preMntation "The perception of phonetic gestures* as if phonetic 
gestures art perceived. By phonetic gestures we refer to organized movements one or 
more vocal-tract structures that realize phonetic dimensions of an utterance (cf. Browman 
h Goldstein, 1986; in press a), itn example of a gesture is bilabial dosure for a stop, which 
includes contributions by the jaw and the upper and lower lips. Gestures are organized 
into larger segmental and suprasegmental groupings, and we do not intend to imply that 
these larger organizations are not perceived as well. We focus on gestures to emphasize a 
claim that, in speech, perceptual objects are fundamentally articulatory as well as 
Ungi^tic. 

That is, in speech perception, articulatoiy events have a status quite different from that of 
their acoustic products. The former are perceived, whereas the latter are the means (or one 
of the means) by which they are perceived. 

A claim that phonetic gestures are perceived is not uncontroversial, of course, and there 
are otiier points of view (e.g., Massaro, 1987; Stevens & Blu^stein, 1981). We do not 
intend to consider these other views here, however, but instead to focus on agreements and 
disagreements between two theoretical perspectives from which the claim is made. 
Accordingly, we begin by .summarizing some of die evidence ttiat, in our view, justifies it 



Phonetic geshures are perceived: 
Thkee sources of evidence 

1. Correspondence failures between acoustic 
signal and percept: Correspondences 
between gestures and percept 

Perhaps the most compelling evidence that 
gestures, and not their acoustic products, are 
perceptual objects is the failure of dimensions of 
speech percepts to correspond to obvious 
dimensioT^s of the acoustic signal and their 
corresponaence, instead, to phonetically-organized 
articulatory behaviors ^hat produce the signal. We 
offer three examples, all of them implicating 
articulatory gestures as perceptual objects and the 
third showing most clearly that the perceived 
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gestures are not surface articulatory movements, 
but rather, linguistically-organized gestures. 

a. Synthetic /di/ and /du/ 

One example from the early work at Haskins 
Laboratories (Liberman, Cooper, Shankweiler, & 
Studdert-Kennedy, 1967) is of synthetic /di/ and 
/du/. Monosyllables, such as those in Figure 1, can 
be synthesized that consist only of two formants. 
The information specifying /d/ (rather than /b/ or 
/g/) in both syllables is the second formant 
transition. These transitions are very different in 
the two syllables, and, extracted from their 
syllables, they sound very different too. Each 
sounds more-or-less like the frequency glide it 
resembles in the visible display. Neither sounds 
like/d/. In the context of their respective syllables, 
however, they sound alike and they sound like/d/. 

The consonantal segments in /di/ and /du/ are 
produced alike too, by a constriction and release 
gesture of the tongue tip against the alveolar ridge 
of the palate. When listeners perceive the 
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synthetic /di/ and /du/ syllables of Figure 1, their 
percepts correspond to the implied constriction 
and release gestures, not, it seems, to the context- 
sensitive acoustic signal. 



3000 r 




Figure 1. SyntheHc syllables /di/ and /du/. The second 
formant transitions identify the initial consonant as /d/ 
rather than as /b/ or /g/« 

b. Functional equivalence of acoustic 
''cues'* 

We expect listeners to be very good at 
distinguishing an interval of silence from 
nonsilence— from a set of frequency glides, for 
example. Too, we expect them to distinguish 
acoustic signals that differ in two ways more 
readily than signals that differ in just one of the 
two ways. Both of these expectations are vio- 
lated — another example of noncorrespondence — if 
the silence and glides are joint acoustic products 
of a common constriction and release gesture for a 
stop consonant 

Fitch, Halwes, Erickson and Liberman (1980) 
created synthetic syllables identified as "slit" or 
"split* by varying the duration of a silent interval 
following the fricative and manipulating the pres- 
ence or absence of transitions for a bilabial stop 
following the silent interval. A relatively long su 
lent interval and the presenc of transitions both 
signal a bilabial stop, the silent interval cuing the 
closure and the transitions the release. Fitch et al. 
found that pairs of syllables differing on both cue 
dimensions, duration of silence and pres- 
ence/absence of transitions, were either more 
discriminate than pairs differing in one of these 
ways or less discriminable depending on how the 
cues were combined. A syllable with a long silent 
interval and transitions was highly discriminable 
from a syllable with a shorter silent interval and 
no transitions; the one was identified as "split" 
and the olher as "slit." A syllable with a short 



silent interval and transitions was nearly 
indiscriminate from one with a longer interval 
and no transitions; both were identified as "split" 
Syllables differing in two ways are indiscrim- 
inable just when the acoustic cues that distinguish 
them are "functionally equivalent"— that is, they 
cue the same articulatory gesture. A long silent 
interval does not normally sound like a set of 
frequency glides, but it does in a context in which 
each specifies a consonantal constriction. 

c Peroeption of intonation 

The findings just summarized, among others, 
reveal that listeners perceive gestures, 
^parently, listeners do not perceive the acjustic 
signal per se. 

Nor, however, do they perceive "raw" 
articulatory motions as such. Rather, they 
perceive linguistically-organized (phonetic) 
gestures. Research on the various ways in which 
fundamental frequency (henceforth, f^) is 
perceived shows this most clearly. 

Perceived intonational peak height will not, in 
general, correspond to the absolute rate at which 
the vocal folds open and close during production of 
the peak. Instead* perception of the peak 
corresponds to just those influences on the rate of 
opening and closing that are caused by gestures 
intended by the talker to affect intonational peak 
height. (Largely, intonational melody is 
implemented by contraction and relaxation of 
muscles of the larynx that tense the vocal folds; 
see, e.g., Ohala, 1978.) There are other influences 
on the rate of vocal fold opening and closing that 
may either decrease or increase fQ. Some of these 
influences, due to lung deflation during an 
expiration ("declination," Gelfer, Harris, <g Baer, 
1987; Gelfer, Harris, Collier, & Baer, 19S6) or to 
segmental perturbations reflecting vowel height 
(e.g., Lehiste & Peterson, 1961) and obstruent 
voicing (e.g., Ohde, 1984), are largely or entirely 
automatic consequences of other things that 
talkers are doing (producing an utt^-ance on an 
expiratory airflow, producing a close or open vowel 
[Honda, 1981], producing a voiced or voiceless 
obstruent [Ohde, 1984]; Lfifqvist, Baer, McGarr, & 
Seider Story, in press). They do not sound like 
changes in pitch; rather, they sound like what 
they are: information for early-to-late serial 
position in an utterance in the case of declination 
(Pierrehumbert, 1979; see also Lehiste, 1982), and 
information for vowel height (Reiraolt Peterson, 
1986; Sib'erman, 1987) or consonant voicing (e.g., 
Silverman, i386) in the case of segmental 
perturbations. 
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As we will suggest below (under ''How acoustic 
structure may serve as information for speech 
perceivers"), listeners apparently use conflg- 
urations of changes in different acoustic variabbs 
to recover the distinct, organized articulatcry 
systems that implement the various linguiscic 
dimensions of talkers' utterances. By using 
acoustic information in this way, listeners can 
recover what Liberman (1982) has called the 
talker's phonetic intents.* 

2. Audio-visual integration of gestiun^l 
information 

A video display of a face mouthing /ga/ 
synchronized with an acoustic signal of the 
speaker saying /ba/ is heard most typically as "da** 
(MacDonald & McGurk, 1978). Subjects' 
identifications of syllables presented in this type 
of experiment reflect an integration of information 
from the optical and acousUc sources. Too, as 
Liberman (1982) points out, the integration affects 
what listeners experience hearing to an extent 
that they cannot tell what contribution to their 
perceptual experience is made by the acoustic 
signal and what by the video display.^ 

Why does integration occur? One answer is that 
both sources of information, the optical and the 
acoustic, provide information apparently about 
the same event of talking, and they do so by 
providing information about the talkers' phonetic 
gestures. 

3. Shadowing 

Listeners' latency to repeat a syllable th^ hear 
is very shori— in Porter's research (Porter, 1976; 
Porter & Lubker, 1980), around 180 ms on 
average. Even though these latencies are obtained 
in a choice reaction time procedure (in which the 
vocal response required is different for different 
stimuli to respond), latencies approach simple 
reaction times (in which the same response occurs 
to any stimulus to respond), and they are much 
shorter than choice reaction times usiikg a button 
press. 

Why should these particular choice reaction 
times be so fast? Presumably, the compatibility 
between stimulus and response explains the fast 
response times. Indeed, it efTectively eliminates 
the element of choice. If listeners perceive the 
talker's phonetic gestures, then the only response 
requiring essentially no choice at all is one that 
reproduces those gestures. 

The motor theory 
Throughout most of its history, the motor theory 
(e.g., Liberman, Cooper, Harris, & MacNeilage, 



1953; Liberman et al., 1967; Liberman & 
Mattingly, 1985; see also. Cooper, Delattre, 
Liberman, Borst, & Gerstmpn, 1952) has been the 
only theory of speech perception to identify the 
phonetic gesture as an object of perception. Here 
we describe the motor theory by discussing what, 
more precisely, the motor theorists have 
considered to be the object of perception, how they 
characterize the process of speech perception and 
why, recently, they have introduced the idea that 
speech perception is accomplished by a specialized 
module. 

What is perceived for the motor theorist? 

Coarticulation is the reason why the acoustic 
signal appears to correspond so badly to the 
sequences of phonemes that talkers intend to 
produce. Due to coarticulation, phonemes are 
produced in overlapping time frames so that the 
acoustic signal is everywhere (or nearly 
everywhere; see, e.g.. Stevens & Blumstein, 1981), 
context-sensitive. This makes the signal a complex 
"code" on the phonemes of the language, not a 
cipher, like an alphabet.2 In *Terception of the 
speech code" (1967), Liberman and his colleagues 
speculated that coarticulatory ''encoding" is, in 
part, a necessary consequence of properties of the 
speech articulators (their sluggishness, for 
example). However, in their view, coarticulation is 
also promoted both by the nature of phonemes 
themselves — that they are realized by sets of 
sub phonemic features^ — and by the listener's 
short-term memory, which would be overtaxed by 
the slow transmission rate of an acoustic cipher. 

In producing speech, talkers exploit the fact that 
the different articulators — the lips, velum, ja\/, 
etc. — can be independently controlled. 
Subphonemic features, such as lip rounding, 
velum lowering and alveolar closure each use 
subsets of the articulators, often just one; 
therefore, more than one feature can be produced 
at a time. Speech can be produced at rapid rates 
by allowing "paraP l transmission" of the 
subphonemic features of different phonemes. This 
increases the transmission rates for listeners, but 
it also creates much of the encoding that is 
considered responsible for the apparent lack of 
invariance between acoustic and phonetic 
segments. 

The listener's percept corresponds, it seems, 
neither to the encoded cues in the acoustic signal 
nor even to the also-encoded succession of vocal 
tract shapes during speech production, but instead 
to a sequence of discrete, unencoded phonemes, 
each composed of its own comporent subphonemic 
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features. To explain why ''perception mirrors 
articulation more closely than sound" (p. 453) and 
(yet) achieves recovery of discrete unencoded 
phonemes, the motor theorists proposed as a first 
hypothesis that perceivers somehow access their 
speech-motor systems in perception and that the 
percept they achieve corresponds to a stage in 
production before encoding of the speech segments 
takes place. In "Perception of the speech code," the 
stage was one in which "motor commands" to the 
muscles were selected to implement subphonemic 
features. In "The motor theory revised," 
(Liberman & Mattingly, 1985), a revision to the 
theory reflects developments in our understanding 
of motor control. Evidence suggests that activities 
of the vocal tract are products of functional 
couplings among articulators (e.g., Folkins & 
Abbs, 1975, 1976; Kelso, Tullei, Vatikiotis- 
Bateson, & Fowler, 1984), which produce gestures 
as defined earlier, not independent movements of 
the articulators identified with subphonemic 
features in "Perception of the spe'^ch code " In 
"The motor theoiy revised," control structures for 
gestures have replaced motor commands for 
subphonemic features as invariants of production 
and as objects of perception for listeners.^ Like 
subphonemic features, control structures arc 
abstract, prevented by coarticulation from making 
public appearances in the vocal tract. Liberman 
and Mattingly write of the perceptual objects of 
the revised theory: 

We would argue, then, Uiat the gestures do have 
characteristic invariant properties, as the motor 
theory requires, though these must be seen, not as 
peripheral movements, but as the more remote 
structures that control the movements. These 
structures correspond to the speaker's intentions, 
(p. 23) 

In recovering abstract gestures, processes of 
speech perception yield quite different kinds of 
perceptual objects than general auditory 
perception. In auditory perception, more 
generally, according to Liberman and Mattingly, 
listeners hear the signal as "ordinary sound" 
(p. 6); that is, they hear the acoustic signal as 
such. In other publications, Mattingly and 
Liberman (1988) refer to this apparently more 
straightforward perceptual object as 
"homomorphic" in contrast to objects of speech 
perception which are "heteromorphic." An 
example they offer of homomorphic auditory 
perception is perception of isolated formant 
transitions which sound like the frequency glides 
they resemble in a spectrographic display. 



How perception takes place in the motor 
theoiy 

In the motor theory, listeners use ''analysis by 
Sjnni thesis" to recover phonetic gestures from the 
encoded, informationally-impoverished acoustic 
signal. This aspect of the theory has never been 
worked out in detail. However, in general, 
analysis by synthesis consists in analysing a 
signal by guessing how the signal might have been 
produced (e.g., Stevens, 1960; Stevens & Halle, 
1964). Liberman and Mattingly refer to an 
"'internal^ innately specified vocal-tract 
synthesizer that incorporates complete 
information about the anatomical and 
physiological characteristics of the vocal tract end 
also about the articulatory -^nd acoustic 
consequences of linguistically significant gestures" 
(p. 26)« The synthesizer computes candidate 
gestures and then determines which of those 
gestures, in combination with others identified as 
ongoing in the vocal-tract could account for the 
acoustic signal. 

Speech pei '^eption as modular 

If speech perception does involve accessing the 
speech-motor system, then it must indeed be 
special and quite distinct from gener&t auditory 
perception. It is special in its objects of perception, 
in the kinds of processes applied to the acoustic 
signal, and presumably in the neural h/stems 
dedicated to those processes as well. Liberman 
and Mattingly propose that speech perception is 
achieved by a specialized module. 

A module (Fodor, 1983) is a cognitive system 
that tends to be narrowly specialized ^domain 
specific"), using computations that are special 
("eccentric") to its domain; it is computationally 
autonomous (so that different systems do not 
compete for resources) and prototypically is 
associated with a distinct neural substrate. In 
addition, modules tend to be ''informationally 
encapsulated," bringing to bear on the processing 
they do only some of the relevant information the 
perceiver may have; in particular, processing of 
"input" (perceptual) systems— prime examples of 
modules — is protected early on from bias by "top- 
down" information. 

The speech perceptual system of the motor 
theory has all of these characteristics. It is 
narrowly specialized and its perception-production 
link is eccentric; moreover, it is associated with a 
specialized neural substrate (e.g., Kimura, 1961). 
In addition, as the remarkable phenomenon of 
duplex perception (e.g., Liberman, Isenberg, & 
Rakerd, 1981; Rand, 1974) suggests, the speech 
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perceiving system is autonomous and 
informationally encapsulated. 

In duplex perception as it is typically 
investigated (e.g., Liberman et al., 1981; Mann & 
Liberman, 1983; Repp, Milbum» & Ashkenas, 
1983), most of an acoustic CV pliable (the "base* 
at the left of Figure 2} is presented to one ear 
while the remainder, generally a formant 
transition (either of the "chirps" on the right side 
of Figure 2) is presented to the other ear. Heard in 
isolation, the base is ambiguous between "da" and 
"ga," but listeners generally report hearing "da." 
(It was identified as "da" 87% of the time in the 
study by Repp et al., 1983.) In isolation, the chirps 
sound like the frequency glides they resemble; 
they do not sound speech-like. Presented 
dichotically, listeners integrate the chirp and the 
base, hearing the int^rated "da" or "ga" in the ear 
receiving the base. Remarkably, in addition, they 
hear the chirp in the other ear. Researchers who 
have investigated duplex perception describe it as 
perception of the same part of an acoustic signal 
in two ways simultaneously. If that character- 
ization is correct, it implies strongly that the 
percepts are outputs of two distinct and 
autonomous perceptual systems, one specialized 
for speech and the other perhaps general to other 
acoustic signals. 



A striking characteristic of speech perceptual 
systems that integrate syllable fragments 
presented to different ears is Uieir imperviousness 
to information in the spatial separation of the 
fragments that they cannot possibly be part of the 
same spoken syllable — an instance, perhaps, of 
information encapsulation. 

In recent work, Mattingly and Liberman (in 
press) have revised, or expanded on, Fodor's view 
of modules by proposing a distinction between 
"closed" and "open" modulus. Closed modules, 
including the speech module and a sound- 
localization module, for example, are narrowly 
specialized as Fodor has characterized modul^ 
more generally. In addition (among other special 
properties), they yield heteromorphic percepts — 
that is, percepts whose dimensions are not those 
of the proximal stimulation. Although Mattingly 
and Liberman characterize the heteromorphic 
percept in this way— -in terms of what it does not 
conform to, it appears that the heteromorphic 
percept can be characterized in a more positive 
way as well. The dimensions of heteromorphic 
percepts are those of distal events, not of proximal 
stimidation. The speech module renders phonetic 
gestures; the sound-localization module renders 
location in space. By contrast, open modules are 
sort of "everything-else" perceptual systems. 



Normal syllablts 
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Figure 2. Stimuli that }rield duplex pcKcption. The bate is presented to one ear and the third fonnants to another. In 
the ear to which the base is presented, listeners hear the syllable specified jointly by the base and the transitions; in 
the other ear, they hear the transitions as frequency glides. (Figure adapted from Whalen 4c Liberman, 1987). 
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An open auditory-perceptual module is 
responsible for perception of most sounds in the 
environment According to the theory, outputs of 
open modules are homomorphic. 

In the context of this account of auditory 
perception, the conditions under which duplex 
perception is studied are seen as somehow 
tricking the open module into providing a percept 
of the isolated formant transition even though the 
transition is also being perceived by the speech 
module. Accordingly, two percepts are provided for 
one acoustic fragment; one percept is 
homomorphic and the other is heteromorphic. 

Prospectus 

Our brief overview of the motor theory obviously 
cannot do justice to it. In our view, it is, to date, 
superior to other theories of speech perception in 
at least two migor respects: in its ability to handle 
the full range of behavioral findings on speech 
perception— in particular, of course, the evidence 
that listeners recover phonetic gestures— andJn^ 
having developed its account of speech in the 
context of a more general theory of biological 
specializations for perception. 

Our purpose here, however, is not just to praise 
the theory, but to challenge it as well, with the 
further aim of provoking the motor theorists 
either to buttress their theory where it appears to 
us vulnerable, or else to revise it further. 

We will raise three general questions from the 
perspective of our own, direct-realist theory 
(Fowler, 1986a,b; Rosenblum, 1987). First we 
question the inference from evidence that 
listeners recover phonetic gestures that the 
listener's owr* speech-motor system plays a role in 
perception. The nature of the challenge we mount 
to this inference leads to a second one. We 
question the idea that, whereas a specialized 
speed module — and other closed modules — 
render heteromorphic percepts, other percepts are 
homomorphic. Finally, we challenge the idea in 
any case that duplex perception reveals that 
speech perception is achieved by a closed module. 

Standing behind all of these specific questions 
we raise about claims of the motor theory is a 
general issue that ne^ds to be confronted by all of 
us who study speech perception and, for that 
matter, perception more generally. The issue is 
one of determining when behavioral data warrant 
inferences being drawn about perceptual 
processes taking place inside perceivers and when 
the data deserve accounting instead in terms of 
the nature of events taking place publicly when 
somethin;;; is perceived. 



Does perceptual recovery of phonetic 
gestures implicate the listener's speech 
motor system? 

In our view, the evidence that perceivers recover 
phonetic gestures in speech perception is 
incontrovertible^ and any theory of speech 
perception is inadequate unless it can provide a 
unified account of those findings. However, the 
motor theorists have drawn an inference from 
these findings that, we argue, is not warranted by 
the general observation that listeners recover 
gestures. The inference is that recovery of 
gestures implies access by the perceiver to his own 
speech-motor system. It is notable, perhaps, that, 
in neither "Perception of the speech code" norTThe 
motor theory revised,** do Liberman and his 
colleagues offer any evidence in support of this 
claim except evidence that listeners recover 
gestures (and that human left-cerebral 
hemispheres are specialized for speech and 
especially for phonetic perception [e.g., Kimura, 
1961; Liberman, 1974; Studdert-Kennedy & 
Shankweiler, 1970]). 

There is another way to explain why listeners 
recover phonetic gestures. It is that phonetic 
gestures are among the ''distal events" that occur 
when speech is perceived and that perception 
universally involves recovery of distal events from 
information in proximal stimulation. 

Distal events universalb* are perceptual 
objects: Proximal stimuh universally are 
not. 

Consider first visual perception observed from 
outside the perceiver.^ Visual perceivers recover 
properties of objects and events in their 
environment ("distal events'O. They can do so, in 
part, because the environment supplies 
information about the objects and events in a form 
that their perceptual systems can use. Light 
reflects from objects and events, which structure it 
lawfully; given a distal event and light from some 
source, the reflected light must have the structure 
ttiat it has. To the extent that the structure in the 
light is also specific to the properties of a distal 
event that caused it, it can serve as information to 
a perceiver about its distal source. The reflected 
light ("proximal stimulation") has another 
property that permits it its central role in 
perception. It can stimulate the visr^al system of a 
perceiver and thereby impart its structure to it 
From there, the perceiver can use the structure as 
information for distal-event perception. 
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The reflected light does not provide information 
to the visual system by picturing the world. 
Information in reflected light for 'looming^ (that 
is, for an object on a collision course with the 
perceiver^s head), for example, is a certain manner 
of expansion of the contours of the object's 
reflection in the li^t, that progressively covers 
the contours of optical reflections of immobile 
parts of the perceiver^s environment. When an 
object looms, it does not grow; it approaches. 
However, its optical reflection grows, and, 
confronted with such an optic array^ perceivers 
(from fiddler crabs to kittens to rhesus monkeys to 
humans [Schiff, 1965; Schiff, Caviiiess, & Gibson, 
1962]) behave as if they perceive an object on a 
collision course; that is, they try to avoid it 

Two related conclusions from this 
characterization of visual perception are first that 
observers see distal events based on information 
about them in proximal stimulation and second 
that, in Mattingly and Liberman's terms, visual 
perception therefore is quite generally 
heteromorphic. It is not merely heteromorphic in 
respect to those aspects of stimulation handled by 
closed modules (for example, one that recovers 
depth information from binocular disparity); it is 
generally the case that the dimensions of the 
percept correspond with dimensions of distal 
objects and events and not necessarily with those 
of a distal-event-free description of Uie proximal 
stimulation.7 

Auditory perception is analogous to visual 
perception in its general character, viewed, once 
again from outside the perceiver. Consider any 
sounding object, a ringing bell, for example. The 
ringing bell is a ''distal event** that structures an 
acoustic signal. The structuring of the air by the 
bell is lawful and, to the extent that it also tends 
to be specific lo its distal source, the structure can 
provide information about the source to a 
sensitive perceiver. Like reflected light, the 
acoustic signal (the proximal stimulation) in fact 
has two critical properties that allow it to play a 
central role in perception. It is lawfully structured 
by some distal event and it can stimulate the 
auditory system of perceivers, thereby imparting 
its structure to it The perceiver then can use the 
structure as information for its source. 

As for structure in reflected light, structure in 
an acoustic signal does not resemble the sound- 
producing source in any way. Accordingly, if 
auditory perception works similarly to visual 
perception — that is, if perceivers use structure in 
acoustic signals to recover their distal sources. 



then auditory percepts, like visual percepts will be 
heteromorphic. 

Liberman and Mattingly (1985; MatMngly & 
liiberman, 1988) suggest, however, that in 
general, auditory perceptions are homomorphic. 
We agree that our intuitions are less clear here 
than they are in the case of visual perception. 
However, it is an empirical question whether 
dimensions of listeners' percepts are better 
explained in terms of dimensions of distal events 
or of a distal-event free description of proximal 
stimulation. To date the question is untested, 
however; for whatever reason, researchers who 
study auditory perception rarely study perception 
of natural sound-producing events (see, however. 
Repp, 1987; VanDerVeer, 1979; Warren & 
Verbrugge, 1984). 

Now consider speech perception. In speech, the 
distal event — at least the event in the 
environment that structures the acoustic speech 
signal — is the moving vocal tract If, as we 
propose, the vocal tract produces phonetic 
gestures, then the distal event is, at the same 
time, the set of phonetic gestures that compose the 
talker^s spoken message. The proximal stimulus is 
the acoustic signal, lawfully structured by 
movement in the vocal tract. To the extent that 
the structure in the signal also tends to be specific 
to the events that caused it, it can serve as 
information about those events to sensitive 
perceivers. The information that proximal 
stimulation provides will be about the phonetic 
gestures of the vocal tract. Accordingly, if speech 
perception works like visual perception, then 
recovery of phonetic gestures is not eccentric and 
does not require eccentric processing by a speech 
module. It is, instead, yet another instance of 
recovery of distal events by means of lawfully- 
generated structure in proximal stimulation. 

The general point we hope to make is that, 
arguably, all perception is heteromorphic, with 
dimensions of percepts always corresponding to 
those of distal events, not to distal-event free 
descriptions of proximal stimuli. Speech is not 
special in that regard. A more specific point is that 
even if evidence were to show that speech 
perceivers do access their speech-motor systems, 
that perceptual process would not be needed to 
provide the reason why listeners' percepts are 
heteromorphic. The reason percepts are 
heteromorphic is that perceivers universally use 
proximal stimuli as information about events 
taking place in the world; they do not use them as 
perceptual objects per se. 
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Are phonetic gestures public or private? 

Although in Perception of the speech code'' and 
•The motor theory revised," evidence that 
listeners recover gestures is the only evidence 
cited in favor of the view that perceivers access 
their speech mot^r systems, that evidence is not 
the only reason why the motor theorists and other 
theorists invoke a construct inside the perceiver 
rather than the proximal stimulation outside to 
explain why the percept has the character it does. 
A very important reason why, for the motor 
theorists, the proximal stimulation is not by itself 
sufficient to specify phonetic gestures is that, in 
their view, phonetic gestures are abstract control 
structures corresponding to the speakers 
intentions, but not to the movements actually 
taking place in the vocal tract. If phonetic 
gestures aren't "out there" in the vocal tract, then 
they cannot be analogous to other distal events, 
because they cannot, themselves, lawfully 
8tructu<*e the acoustic signal. 

In our view, this characterization of phonetic 
gestures is mistaken, however. We can identify 
two considerations that appear to support it, but 
we find neither convincing. One is that any 
gesture of the vocal tract is merely a token action. 
Yet perceivers do not just recognize the token, 
they recognize it as a member of a larger 
linguistically-significant category. That seems to 
localize the thing perceived in the mind of the 
perceiver, not in the mouth of the talker. More 
than that, the same collections of token gestures 
may be identified as tokens of different categories 
by 4tpeakers of different languages. (So, for 
example, speakers of English may identify a 
voiceless unaspirated ^lop in stressed syllable* 
initial position as a /b/, whereas speakers of 
languages in which voiceless unaspirated stops 
can appear stressed-syllable initially may identify 
it as an instance of a /p/.) Here, it seems, the 
information for category membership cannot 
possibly be in the gestures themselves or in the 
proximal stimulation; it must be in the head of the 
perceiver. The second consideration is that 
coarticulation, by most accounts, prevents 
nondestructive realization of phonetic gestures in 
the vocal tract. We briefly address both 
considerations. 

Yet another analogy: There are chairs in the 
world that do not look veiy much like prototypical 
chairs. Recognizing them as chairs may require 
learning how people typically use them (learning 
their "proper function" in Millikan's terms [1984]). 
By most accounts, learning involves some 



enduring change inside the perceiver. Notice, 
however, that even if it does, what makes the 
token chair a chair remains its properties and its 
use in the world prototypically as a chair. Too, 
whatever perceivers may learn about that chair 
and about chairs in general is only what they 
learn; the chair itself and the means by which its 
type-hood can be identified remain unquestionably 
out there in the world. Phonetic gestures and 
phonetic segments are like chairs (in this respect). 
Token instances of bilabial closure are members of 
a type because the tokens all are products of a 
common coupling among jaw and lips realized in 
the vocal tract of talkers who achieve bilabial 
closure. Instances of bilabial closure in stressed- 
syllable-initial position that have a particular 
timing relation to a glottal opening gesture are 
tokens of a phonological category, Jb/, in some 
languages and of a different category, /p/, in 
others because of the different ways that they are 
deployed by members of the different language 
communities. That differential deployment is 
what allowed descriptive linguists to identify 
members of phonemic categories as such, and 
presumably it is also what allows language 
learners to acquire the phonological categories of 
their native language. By most accounts, when 
language learners discover the categories of their 
language, the learning involves enduring changes 
inside the learner. However, even if it does, it is 
no more the case that the phonetic gestures or the 
phonetic segments move inside the mind than it is 
that chairs move inside when we learn how to 
recognize them as such. What we have learned is 
what we know about chairs and phonetic 
segments; it is not the chairs or the phonetic 
segments themselves. They remain outside. 

Turning to coarticulation, it is described in the 
motor theory as "encoding," by Ohala (e.g., 1981) 
as "distortion," by Daniloff and Hammarberg 
(1973) as "assiniliation" and by Hockett (1955) as 
"smashing" and "rubbing together" of phonetic 
segments (in the way that raw eggs would be 
smashed and rubbed together were they sent 
through a wringer). None of these 
characterizations is warranted, however. 
Coarticulation may instead be characterized as 
gestural layering— a temporally staggered 
realization of gestures that sometimes do and 
sometimes do not share one or more articulators. 

In fact, this kind of gestural layering occurs 
commonly in motor behavior. When someone 
walks, the movement of nis or her arm is seen as 
pendular. However, the surface movement is a 
complex (layered) vector including not only the 
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swing of the arm, but also movement of the whole 
body in the direction of locomotion. This layering 
is not described as "encoding,'' "distortion" or even 
as assimilation of the arm movement to the 
movement of the body as a whole. And for good 
reason; that is not what it is. The movement 
reflects a convergence of forces of movement on a 
body segment. The forces are separate for the 
walker, information in proximal stimulation 
allows their parsing (Johansson, 1973), and 
perceivers detect their separation. 

There is evidence already suggesting that at 
least some of coarticulation is gestural layering 
(Carney & Moll, 1971; Ohman, 1966; also see 
Browman & Goldstein, in press a), not encoding or 
distortion or assimilation. There is also convincing 
evidence that perceivers recover seoarate gestures 
more-or-less in the way that Johansson suggests 
they recover separate sources of movement of body 
segments in perception of locomotion. Listeners 
use information for a coarticulating segment that 
is present in the domain of another se^^ment as 
information for the coarticulating segment itself 
(e.g., Fowler, 1984; Mann, 1980; Whalen, 1984); 
they do not hear the coarticulated segment as 
assimilated or, apparently, as difstorted or encoded 
(Fowler, 1981; 1984; Fowler & Snrith, 1986). 

Our colleagues Catherine Browman and Louis 
Goldstein (1985, 1986, in press a,b) have proposed 
that phonetic primitives of languages are gestural, 
not abstract featural. Our colleague Elliot 
Saltzman (1986; Saltzman & Kelso, 1987; see also, 
Kelso, Saltzman, & TuUer, 1986) is developing a 
model that implements phonetic gestures as 
functional couplings among the articulators and 
that realizes tiie gestural layering characteristic 
of coarticulation. To the extent that these 
approaches both succeed, they will show that 
phonetic gestures— speakers' intentions— can be 
realized in the vocal tract nondestructively, and 
hence can structure acoustic signals directly. 

Do listeners need an innate vocal tract 
synthesizer to recognize acoustic reflections of 
phonetic gestures? Although it might seem to 
help, it cannot be necessary, because there is no 
analogous way to explain how observers recognize 
most distal events from their optical reflections. 
Somehow the acoustic and optical reflections of a 
source must identify the source on their own. In 
some instances, we begin to understand the 
means by which acoustic pattemings can specify 
their gestural sources. We consider one such 
instance next. 



How acoustic structure may serve aa 
information for gestures. 

We return to the example previously described 
of listeners' perception of those linguistic 
dimensions of an utterance that are cued in some 
way by variation in fg. A variety of linguistic and 
paralinguistic properties of an utterance have 
converging effects on {q. Yet listeners pull apart 
those effects in perception. 

What guides the listeners' factoring of 
converging effects of fg? Presumably, it is the 
configuration of acoustic products of the several 
gestures that have effects, among others, on fg. 
Intonational peaks are local changes in an {q 
contour that are effected by means that, to a flrst 
approximation, only affect (qI they are produced, 
largely, by contraction and relaxation of muscles 
that stretch or shorten the vocal folds (e.g., Ohala, 
1978). In contrast, declination is a global change 
in {q that, excepting the initial peak in a sentence, 
tracks the decline in subglottal pressure (Gelfer et 
al., 1985; Gelfer et al., 1987), Subglottal pressure 
affects not only fo, but amplitude as well, and 
several researchers have noticed that amplitude 
declines in parallel with fo and resets when fo 
resets at major syntactic boundaries (e.g., 
Breckenridge, 1977; Maeda, 1976). The parallel 
decline in amplitude and fo constitutes 
information that pinpoints the mechanism behind 
the fo decline — gradual lung deflation, 
incompletely offset by expiratory-muscle activity. 
That mechanism is distinct from the mechanism 
by which intonational peaks are produced. 
Evidence that listeners pull apart the two effects 
on fo (Pierrehumbert, 1979; Silverman, 198*^) 
suggests that they are sensitive to the distinct 
gestural sources of these effects on fo. 

By the same token, fo perturbations due to 
height differences among vowels are not confused 
by listeners with information for intonational 
peak height even though fo differences due to 
vowel height are local, like intonational peaks, 
and are similar in magnitude to differences among 
intonational peaks in a sentence (Silverman, 
1987). The mechanisms for the two effects on f^ 
are different, and, apparently, listeners are 
sensitive to that. Honda (1981) shows a strong 
correlation between activity of the genioglossus 
muscle, active in pulHng the root of the tongue 
forward for high vowels, and intrinsic fo of vowels. 
Posterior fibers of the genioglossus muscle insert 
into the hyoid bone of the larynx. Therefore, 
contraction of the genioglossus may pull the hyoid 
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forward, rotating the thyroid cartilage to which 
the vocal folds attach, and there^ ' may stretch 
the vocal folds. Other acoustic consequences of 
genioglossus contraction, of course, are changes in 
the resonances of the vocal tract, which reflect 
movement of the tongue. These changes, along 
with those in fQ (and perhaps others as well) 
pinpoint a phonetic gesture that achieves a vowel- 
specific change in vocal-tract shape. If listeners 
can use that configuration of acoustic reflections of 
tongue-movement (or, more likely, of coordinated 
tongue and jaw movement) to recover the vocalic 
gesture, then they can pull effects on fo of the 
vocalic gesture from those for the intonation 
contour that cooccur with them. 

Listeners do just that In sentence pairs such as 
"They only feast before fasting^ and "They only 
fast before feasting,'' with intonational peaks on 
the "fVst* syllables, listeners require a higher 
peak on ''feast* in the second sentence than on 
Tast** in the first sentence in order to hear the 
first peak of each sentence as higher than the 
second (Silverman, 1987). Compatibly, among 
stendy-state vowels on the same fg, more open 
vowels sound higher in pitch than more closed 
vowels (Stoll, 1984). Intrinsic f^ of vowels does not 
contribute to perception of an intonation contour 
or to perception of pitch. But it is not thrown away 
by perceivers either. Rather, along with spectral 
information for vowel height, it serves as 
information for vowel height (Reinholt Peterson, 
1986). 

We will not review the literature on listeners' 
use of fg perturbations due to obstruent voicing 
except to say that it reveals the same picture of 
the perceiver as the literature on listeners' use of 
information for vowgI height (for a description of 
the fo perturbations: Ohde, 1984; for studies of 
listeners' use of the perturbations: Abramson & 
Lisker, 1985; Pujimura, 1971; Haggard, Ambler, & 
Callow, 1970; Silverman, 1986; for evidence that 
listeners can detect the perturbations when they 
are superimposed on intonation contours: 
Silverman, 1986). As the motor theory and the 
theory of direct perception both claim, listeners' 
percepts do not correspond to superficial aspects of 
the acoustic signal. They correspond to gestures, 
signaled, we propose, by configurations of acoustic 
reflections of those gestures. 

Does duplex perception reveal a closed 
speech module? 

We return to the phenomenon of duplex 
perception and consider whether it does 
convincingly reveal distinct closed and open 



modules for speech perception and general 
auditory perception respectively. As noted earlier, 
duplex perception is obtained, typically, when 
most of the acoustic structure of a synthetic 
syllable is presented to one ear, and the 
remainder — usually a formant transition — is 
presented to the other ear (refer to Figure 2). In 
such instances, listeners hear two things. In the 
ear that gets most of the signal, they hear a 
coherent syllable, the identity of which is 
determined by the transition presented to the 
other ear. At the same time, they hear a distinct, 
non-speech 'chirp' in the ear receiving the 
transition. The percept is duplex— the transition 
is heard as a critical part of a speech syllable, 
hypothetically as a result of it's being processed by 
the speech module, and it is heard simultaneously 
as a non-speech chirp, hypothetically as a result of 
its being processed also by an open auditory 
module (Liberman & Mattingly, 1985). Here we 
offer a different interpretation of the findings. 

Whalen and Liberman (1987) have recently 
shown that duplex perception can occur with 
monaural or diotic presentation of the base and 
transition of a syllable. In this case, duplexity is 
attained by increasing the intensity of the tliird 
formant transition relative to the base until 
listeners hear bot^ an integrated pliable (/da/ or 
/ga/ depending on the transition) and a non-speech 
'whistle' (sinusoids were used for transitions). In 
the experiment, subjects first were asked to label 
the isolated sinusoidal transitions as Ma** or ^ga^ 
Although they were consistent in their labeling, 
reliably identifying one whistle as •'da* and the 
other as ga,* their overall accuracy was not 
greater than chance. About half the subjects were 
consistently right and the remainder were 
consistently wrong. The whistles are distinct, but 
they do not sound like W or "ga.* Next, Whalen 
and Liberman determined 'duplexity thresholds' 
for listeners. They presented the base and one of 
the sinusoids simultaneously and gave listeners 
control over the intensity of the sinusoid. 
Listeners adjusted its intensity to the point where 
they just heard a whistle. At threshold, subjects 
were able to match these duplex sinusoids to 
sinusoids presented in isolation. Finally, subjects 
were asked to identify the integrated speech 
syllables as "da* or "ga* at sinusoid intensities 
both 6 dB above and 4 dB below the duplexity 
threshold. Subjects were consistently good at 
these tasks yielding accuracy scores well above 
90%. 

In the absence of any transition, listeners hear 
only the base and identify it as "da* most of the 
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time. When a sinusoidal transition is present but 
at intensities below the duplexity threshold, 
subjects hear only the unambiguous syllable Tda" 
or ''ga* depending on the transition). Finally, 
when the intensity of the transition reaches and 
exceeds the duplexity threshold, subjects hear the 
*da" or ''ga* and they hear a whistle at the same 
time: i.e., the transition is duplexed. 

This experiment reveals two new aspects of the 
duplex phenomenon. One is that getting a duplex 
percept requires a sufficiently high intensity of 
the transition. A second is that the transition 
integrates with the syllable at intensities below 
the duplexity threshold. Based on this latter 
finding, Whalen and Liberman conclude that 
processing of the sinusoid as speech has priority. 
It is as if a (neurally-encoded) acoustic signal 
must first pass through the speech module at 
which point portions of the signal that specify 
speech events are peeled off. After the speech 
module takes its part, any residual is passed on to 
the auditory module where it is perceived 
homomorphically. Mattingly and Liberman (1988) 
refer to this priority of speech processing as 
''preemptiveness," and Whalen and Liberman 
(1987) suggest that it reflects the ''profound 
biological significance of speech." 

There is another way to look at these findings, 
however. They suggest that duplex perception 
does not, in fact, involve the same acoustic 
fragment being perceived in two ways 
simultaneously. Rather part of the transition 
integrates with the pliable and the remainder is 
heard as a whistle or chirp.^ As Whalen and 
Liberman themselves describe it: 

. . the phonetic mode takes precedence in 
processing the (ransitions» using them for its special 
linguistic purposes until, having appropriated its 
share* it passes the remainder to be perceived by 
the nonspccch system as auditory whistles.- 
(Whalen & Liberman, 1987, p. 171; our itaUcs). 

This is important, because in earlier reports of 
duplex perception, it was tr.<? apparent perception 
of the transition in two different ways at once that 
was considered strong evidence favoring two 
distinct perceptual systems, one for speech and 
one for general auditory perception. In addition, 
research to date has only looked for preemp- 
tiveness using speech syllables. Accordingly, it is 
premature tc conclude that speech especially is 
preemptive. Possibly acoustic fragments integrate 
preferentially whenever the integrated signal 
specifies some coherent sound-producing event. 



We have recently looked for duplex perception in 
perception of nonspeech sounds (Fowler and 
Rosenblum, in press). We predicted that it would 
be possible to observe duplex perception and 
preemptiveness whenever two conditions are met: 
1) A pair of acoustic fragments is presented that, 
integrated, specify a natural distal event; and 2) 
one of the fragments is unnaturally intense. 
Under these conditions, the integrated event 
should be preemptive and the intense fragment 
should be duplexed regardless of the type of 
natural sound-producing event that is involved, 
whether it is ^speech or non-speech, and whether it 
is profoundly biologically significant or biologically 
trivial. 

There have been other attempts to get duplex 
perception for nonspeech sounds. All the ones of 
which we are aware have used musical stimuli, 
however (e.g., Collins, 1985; Pastore, Schmuckler, 
Rosenblum, & Szczesuil, 1983). We chose not to 
use musical stimuli because it might be argued 
that there is a music module. (Music is universal 
among human cultures, and there is evidence for 
an anatomical specialization of the bvtdn for music 
perception (e.g., Shapiro, Grossman, & Gardner, 
1981). These considerations led us to choose a 
non-speech event that evolution could not have 
antici]p ated. We chose an event involving a recent 
human artifact: a slamming metal door. 

To generate our stimuli, we recorded a heavy 
metal door (of a sound-attenuating booth) being 
slammed shut A spectrogram of this sound can be 
seen in Figure 3a. To produce our 'chirp/ we high- 
pass filtered the signal above 3000 Hz. To produce 
a "base,' we low-passed filtered the original signal, 
also at 3000 Hz (see bottom panels of Figure 3). To 
us, the high-passed 'chirp' sounded like a can of 
rice being shaken, while the low-pass-filtered base 
sounded like a wooden door being slammed shut 
(That is, the clanging of the metal door was 
largely absent) 

We asked sixteen listeners to identify the 
original metal door, the base and the chirp. The 
modal identifications of the metal door and the 
base included mention of a door; however, less 
than half the subjects reported hearing a door 
slam. Even so, essentially all of the identifications 
involved hard collisions of some sort (e.g., boots 
clomping on stairs, shovel banged on sidewalk). In 
contrast, no subject identified the chirp as a door 
sound, and no identifications described hard 
collisions. Most identifications of the chirp 
referred to an event involving shaking 
(tambourine, maracas, castinets, keys). 
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Figure 3. DitpUy o5 stimuli uMd to obtain duplex peiccptii 

Given our metal-door chirp and a high-pass- 
filtered wooden door slam, subjects could not 
identify which was a filtered metal door slam and 
which a filtered wooden door slam. Subjects were 
consistent in their labeling judgments, identi 
fying one of the chirps as a metal door and the 
other as a wooden door. However, overall, more of 
them were consistently wrong than right. On 
average, they identified the metal-door chirp as 
the sound of a metal door 31% of the time and the 
wooden-door chirp as a metal door sound 79% of 
the timefi 

To test for duplex perception and preemptive- 
ness, we first trained subjects to identify the 
unfiltered door sound as a "metal door,** the base 
as a Vooden door^ and the upper frequencies of 
the door as a shaking sound. Next we tested them 
on stimuli created from the base and the chirp. We 
created 15 different diotic. stimuli. All included the 
base, and almost all included the metal-door chirp. 
The stimuli differed in the intensity of the chirp. 
The chirp was attenuated or amplified by 
multiplying its digitized voltages by the following 
values: 0, .05, .1, .15, .2, .9, .95, 1, 1.05, 1.1, 4, 4.5, 
5, 5.5 and 6. That is, there were 15 different 
intensities falling into three ranges; five were well 
below the natural intensity relationship of the 
chirp to the base, five were in the range of the 
natural intensity relation, and five were well 



1 of closing*door sounds* 

above it. Three tokens of each of these stimuli 
were presented to subjects diotically in a 
randomized order. Listeners were told that they 
might hear one of the stimuli, metal door, wooden 
door or shaking sound, or sometimes two of thera 
simultaneously, on each trial. They were to 
indicate what they heard on each trial 1^ writing 
an identifying letter or pair of letters on their 
answer sheets. 

In our analyses, we have grouped responses to 
the 15 stimuli into three blocks of five. In Figure 
4, we have labeled these Intensity Conditions low, 
medium, and high. Figure 4 presents the results 
as percentages of responses in the various 
response categories across the three intensity 
conditions. We show only the three most 
interesting (and most frequent) responses. The 
figure shows that the most frequent response for 
the low intensity condition is 'wooden door,' the 
label we asked subjects to use when they heard 
the base. The most frequent response for the 
medium condition is 'metal door,' the label we 
asked subjects to use when they hoard the metal 
door slam. The preferred response for the high- 
intensity block of stimuli is overwhelmingly Wtal 
door + diirp,' the response that indicates a duplex 
percept. The changes in response frequency over 
the three intensity conditions for each response 
type are highly significant 
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Low Middle High 



Intensity Condition 



Figure 4. Percentage of responses falling in the three response categories, 'wooden door,'' 'metal door/ and 'metal 
door plus shaldng sound"" across three intensity ranges of tlae shaking sound. 



Our results can be summarized as follows. First, 
at very low intensities of the upper frequencies of 
the door, subjects hear the base only. When the 
'chirp' is amplified to an intensity at or near its 
natural intensity relation to the base, subjects 
report hearing a metal door the majority of the 
time. Further amplification of the 'chirp/ leads to 
reports of the metal door and a separate shaking 
sound. The percept is duplex, and the metal door 
slam is preemptive. 

There are several additional tests that we must 
run to determine whether our door slams, in fact, 
are perceived analogously to* speech syllables in 
procedures revealing duplex perception. If we can 
show that they are, then we will conclude that an 
account of our findings that invokes a closed 
module is inappropriate. Evolution is unlikely to 
have anticipated metal door slams, and metal- 
door slams aren't profoundly biologically 
significant. We suggest alternatively that 
preemptiveness occurs when a chirp fills a *lio\e'^ 
in a simultaneously presented acoustic signal so 
that, together the two parts of the signal specify 
some sound-producing distal event. If an3rthing is 



left over after the hole is filled, the remainder is 
heard as separate. 

Summary and concluding remarks 

We have raised three challenges to the motor 
theory. We challenge their inference from 
evidence that phonetic gestures are perceived that 
speech perception involves access to the talker's 
own motor system. The basis for our challenge is a 
claim that dimensions of percepts always conform 
to those of distal events even in cases where 
access to an internal synthesizer for the events is 
unlikely. A second, related, challenge is to the idea 
that only some percepts are heteromorphic— just 
those foi which we have evolved closed modules. 
When Liberman and Mattingly write that speech 
perception is heteromorphic, they mean 
heteromorphic with respect to structure in 
proximal stimulation, but they always mean as 
well that the percept is homomorphic with respect 
to dimensions of the distal source of the proximal 
stimulation. We argue that percepts are generally 
heteromorphic with respect to structure in 
proximal stimulation, but, whether they are or 
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not, th«y are always homomorphic with respect to 
dimensions of distal events. Finally, we challenge 
the interpretation of duplex p ^ception that 
ascribes it to simultaneous processing of one pert 
of an acoustic signal by two modules. We suggest, 
instead, that duplex perception reflects the 
listener's parsing of acoustic structure into 
disjoint parts that specify, insofar as the acoustic 
structure permits, coherent distal events. 

Where (in our view) does this leave the motor 
theory? It is fundamentally right in its claim that 
listeners perceive phonetic gestures, and also, 
possibly, in its claim that humans have evolved 
neural systems specialized for perception and 
production of phonetic gefttures. It is wrong, we 
believe, specifically in its claims about what those 
spedalizeid ^stems do, and generally in the view 
that closed modules must be invoked to explain 
why distal events are perceived. 

Obviously, we prefer our own, direct-realist, 
theory, not so much because it handles the data 
better, but because, in our view, it fits better in a 
universal theory of perception. But however our 
theory may be judged in relation to the motor 
theory, we recognize that we would not have 
developed it at all in the absence of the important 
discoveries of the motor theorists that gestures 
are perceived. 
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FOOTNOTES 

*In L G. Mattingly k M. Studdert-Kenncdy (Eds.), Modularity 

and the motor theory of speech perception. I liUsdale, NJ: Lawrence 

Eribaum Assodatcsi, in press. 

^Also Dartmouth College Hanover, New Hampshire 
^Also University of Connecticut, Storrs. Now at the University 

of California at Riverside Department of P^ydu)Iogy. 
^There is a small q lification to the daim that listeners cannot 

tell what contributions visible and audible information each 

have to their perceptual experience in the McCurk effect. 

Massaro 0987) has shown that effects of the video display can 

be reduced but not eliminated by instructing subjects to look at, 

but to ignore, the display. 

^Uberman et aL identify a cipher as a system in which each 
unique unit of the message maps onto a unique symbol. In 
contrast, in a code, the oorre^ndence between message unit 
andsymbdisnot 1:1. 

Sherman et al. propose to replace the more conventional view 
of the features cf a phoneme (for example, that of Jakobson, 
Fant It Halle, 1951) with one of features as "implicit 
insti^ons to separate and independent ptrts of the motor 
madiiner/* (p. 44^. 

^With one apparent slip on page 2: The objects of speech 
perception are the int »ded phonetic gestures of the speaker, 
represented in the brain as invariant motor commands. . . 

^One can certainly challenge the idea that listeners recover the 
very gestures that occurred to produce a speech signal. 
Obviously there are no gestures at all responsible for most 
synthetic speech or for "sine-wave speech" (eg., Remez, Rubin, 



Pisoni, 4c Carrdl 1981) and quite different behaviors underlie a 
parrof s or mynah bird's mimicking of speech. The daim that 
we arg\ie is inccmtrovertible is that listeners recover geshares 
from spe«ch*like signab, even those generated in some other 
way. (We direct [Fowler, 1986a,b] would also argue that 
"misperoeptions* (hearing phonetic gestures where there are 
none) can only occur in limited varieties of ways— the most 
notable being signals produced by certain mirage-produdng 
human artifacts, such as speech synthesizers or mirage- 
produdng birds. Another, however, possibly, includes signals 
produced to mimic those of normal speakers by speakers with 
pathok)gies of the vocal tract that prevent normal realization of 
gestures.) 

^ere are two almost orthogonal perspectives from which 
perception can be studied. On the one hand, investigators can 
focus on processes inside the perceiver that take place from the 
time that a sense organ to stimulated until a peroqpt to adiieved 
or a response to made to the input. On the other hand, they cui 
k>ok outside the perceiver and ask what, in the envinmment, 
the organism under study perceives, what information in 
stimulation to the sense organs allows perception of the things 
perceived, and finally, whether the organisms in fact use the 
postulated information. Here we focus on thto latter 
perspective, most dosely associated %vith the work of James 
Qbson (eg. 1966; 1979; Reed & Joncs^ 1982). 

^It to easy to And examples in which pesception to hcterooiorphic 
with respect to the proximal stimulation and homomorphic 
with reject to dtotal events— looming, for example We cut 
also think of some examples in which perception appears 
homomorphic with respect to proximal stimulatioiv but in the 
examples we have come up with, they are homomorphic with 
respect to the dtotal event as well (p€Tceptk>n of a line drawn 
by a pencil, for example), and so there to no way to dedde 
whether perception to of the proximal stimulation or of the 
dtotal event We challenge the motor theorists to provide an 
example in which perception to homomorphic with structure in 
proximal stimulatton that to not also homomorphic with dtotal 
event structure These would provide convincing cases of 
proximal stimulation perception. 

'Bregman (1987) considers duplex perception to dsconArm hto 
"nile of dto|olnt allocation" in acousUc scene analysis by 
Itoteners. According to the rule, each acoustL: fragment to 
assigned in perception to one and only one ei vironmental 
source. It seems, however, that duplex percepuon does not 
disoonflrm the rule 

^sing a more sensitive, AXB, test, however, we have found that 
Itoteners can match the metal door chirp, rsther than a wooden 
door chirp, to the metal door slam at performance leveto 
considerably better than chance. 
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1. INTRODUCTION 

This paper presents the results of our recent 
experimental invi'^tigations of a central issue in 
linguistic theory: which properties of human 
language are innately determined? There are two 
main sources of information to be tapped to find 
the answer to this question. First, universal 
properties of human languages are plausibly (even 
if not necessarily) taken to be innately 
determined. In addition, properties that emerge in 
children's language in the absence of decisive 
evidence in their linguistic input are reasonably 
held to be innate. Clearly, it would be most 
satisfactory if these two diagnostics for what is 
innate agreed with each other. In some cases they 
do. For example, there is a universal principle 
favoring transformational movement of phrases 
rather than of lexical categories e.g., topicalization 
of noun phrases but not of nouns. To the best of 
our knowledge children abide by this principle; 
they way hear sentences such as Candy, you can't 
have now, but they don't infer that nouns can be 
topicalized. If they did, they would say things like 
^Vegetables, I won't eat the. But this is not an 
error characteristic of children. Instead, from the 
moment they produce topicalized constructions at 
all, they apparently produce correct NP- 
topicalized forms such as The vegetables, I won't 
eat. 

In recent years, this happy convergence of 
results from research on universals and research 

Thii reiearch wag tupported in part by NSF Grant 
BNS 84-18637, and by a Program Project Grant to Haaldna 
Laboratoriea horn the National Inititute of Child Health and 
Human Development (HD^19M). The ttudiee reported in thia 
paper were conducted in collaboration with several friends and 
colleagues: Henry Hamburger, Paul Gorrell, Howard Lasnik, 
Cecile McKee, Keiko Murasugi, Minehani Nakayama, Jaya 
Sarma and Rosalind Thornton. We thank them for their 
permissk>n to gnther this work together here. 



on acquisition has been challenged by 
experimental studies reporting various syntactic 
failures on the part of children. The children in 
these experiments are apparently violating 
putatively universal phrase structure principles or 
constraints on transformations. Failure to 
demonstrate early knowledge of syntactic 
principles is reported by Jakubowicz (1984), Lust 
(1981), Matthei (1981, 1982), Phinney (1981), 
Roeper (1986), Solan and Roeper (1978), 
Tavakolian (1978, 1981) and Wexler and Chien 
(1985). Some explanation is clearly called for if a 
syntactic principle is respected in all adult 
languages but is not respected in the language of 
children. 

Assuming that the experimental data accurately 
reflect children's linguistic competence, there are 
several possible responses to the unaccom- 
modating data. The most extreme would be to give 
up the innateness claim for the principle in 
question. One might look for further linguistic 
data which show that it isn't universal. Or one 
might abandon the hypothesis that all universal 
principles are innate. For instance, Matthei (1981) 
obtained results that he interpreted as evidence 
that universal constraints on children's 
interpretation of reciprocals are learned, not 
innate. However, this approach is plausible only if 
one can offer some other explanation (e.g., 
functional explanation) for why the constraints 
should be universal. But this is not always easy; 
as Chomsky (1986) has emphasized, many 
properties of natural language are arbitrary and 
have no practical motivation. 

A different response to the apparent fiiilure of 
children to respect constraints believed to be 
innate is to argue that the constraints are as yet 
inapplicable to their sentences. The claim is that 
as soon as a child's linguistic analyses have 
reached the level of sophistication at which a 
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universal constraint becomes relevant, then that 
constraint will be respected. For example, Otsu 
(1981) has argued that children who give the 
appearance of violating a universal constraint on 
extraction may not yet have mastered the 
structure to which the constraint applies; they 
may have only some simpler approximation to the 
construction, lacking the crucial property that 
engages the constraint. We will discuss the 
evidence for this below. 

A different approach also accepts the 
recalcitrant data as valid, but rejects the inference 
that the data are inconsistent with the innateness 
hypothesis. It is pointed out that it is possible for 
a linguistic principle to be innately encoded in the 
human brain and yet not accessible to the 
language faculty of children at early stages of 
language acquisition. The principle in question 
might be biologically timed to become effective at 
a certain maturational stage. Like aspects of body 
development (e.g., the secondary sex 
characteristics), linguistic principles might lie 
dormant for many years. One recent proposal 
invoking linguistic maturation, by Borer and 
Wexler (1987), contends that a syntactic principle 
underlying verbal passives undergoes 
maturational development. They maintain that 
before a critical stage of maturation is reached, 
children are unable to produce or comprehend 
passive sentences (full verbal passives with by- 
phrases). 

It may eventually turn out that the innateness 
hypothesis must be augmented by maturation 
assumptions in certain cases. But such 
assumptions introduce new degrees of freedom 
into the theory, so its empirical claims are 
weakened. Unless some motivated predictions can 
be made about exactly when latent knowledge 
will become effective, a maturational approach is 
compatible with a much vnder range of data than 
the simplest and strongest version of the 
innateness hypothesis, viz. that children have 
access to the same set of universal principles at 
all stages of language development. This more 
restricted position is the one to be adopted until or 
unless there is clear evidence to the contrary, e.?., 
clear evidence of a period or a stage at which all 
children violate a certain constraint, in all 
constructions to which it is applicable, simple as 
well as complex, and in all languages. So far, no 
such case has been demonstrated.^ 

Our research has taken a different approach. 
We argue that the experimental dat& do not 
unequivocally demonstrate a lack of linguistic 



knowledge. We do not deny that children do 
sometimes misinterpret sentences. But the proper 
interpretation of such failures is complicated by 
the existence of a variety of potentially 
confounding factors. Normal sentence 
comprehension involves lexical, syntactic, 
semantic, pragmatic and inferential abilities, and 
the failure of any one of these may be responsible 
for poor performance on an experimental task. It 
is crucial, therefore, to develop empirical methods 
which will distinguish between these various 
factors, so that we can determine exactly where a 
child's deficiencies lie. Until this has been done, 
one cannot infer from children's imperfect 
performance that they are ignorant of the 
grammar of their target language. 

In fact it can be argued in many cases that it is 
non-syntactic demands of the task which are the 
cause of childien's errors. We, propose that task 
performance is weak at first and improves with 
age in large part because of maturation of non- 
linguistic capacities such as short term memory or 
computational ability, which are essential in the 
efHcient practical application of linguistic 
knowledge. This does not deny that many aspects 
of language must be learned and that there is a 
time when a child has not yet learned them. But 
our interpretation of the data does make it 
plausible that young children know more of the 
adult grammar than has previously been 
demonstrated, and also, most significantly, that 
their early grammars do abide by universal 
principles. 

In support of this non-linguistic maturation 
hypothesis, we have reexamined and 
supplemented a number of earlier experimental 
findings with demonstrations that nonsyntactic 
factors were responsible for many of the children's 
errors. The errors disappear or are greatly 
reduced when these confounding factors are 
suitably controlled for. In this paper, we report on 
a series of experimental studies along these lines 
concerned with three kinds of nonsyntactic factors 
in language performance: parsing, plans, and 
presuppositions. We will argue that these other 
factors are crucially involved in the experimental 
tasks by which children demonstrate their 
knowledge, and that they impose significant 
demands on children. If we underestimate the 
demands of any of these other components of the 
total task we thereby underestimate the extent of 
the child's knowledge of syntax. As a result, the 
current estimate of what children know about 
language is misleading. 
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Z CHILDREN'S ERRORS IN 
COMPREHENSION 

In this section we attempt to identify and isolate 
several components of language-related skills, in 
order to gain a better understanding of each, and 
to clarify the relationship between the innateness 
hypothesis and early linguistic knowledge. Very 
little work has been done on this topic. The 
majority of language development studies seem to 
take it for granted that the experimental 
paradigms provide a direct tap into the child's 
linguistic competence. An important exception is a 
study by Goodluck and Tavakolian (1982), in 
which improved performance on a relative clause 
comprehension task resulted from simplification 
of oUier aspects of the syntax and semantics of the 
stimulus sentences (i.e., the use of intransitive 
rather than transitive relative clauses, and 
relative clauses with one animate and one 
inanimate noun phrase rather than two 
animates). The success of these manipulations is 
exactly in accord with our general hypothesis 
about the relation between competence and 
performance. As other demands on the child's 
performance are reduced, greater competence is 
revealed. 

Our experiments focus on three factors involved 
in many child language experiments which may 
interfere with estimation of the extent of 
children's linguistic knowledge in tasks which are 
designed to measure sentence comprehension. 
These factors — parsing, presupposition and 
plans — are of interest in tlieir own right, but have 
received very little attention in previous research 
on syntax acquisition. In this section we will 
review our recent work on these topics. In the 
following section we will turn to an alternative 
research strategy for assessing children's 
knowledge, the technique of elicited production. 

Parsing. Sentence parsing is a complex task 
which is known to be governed (in adults) by 
various decision strategies that favor one 
structural analysis over another where both are 
compatible with the input word sequence. Even 
adults make parsing errors, and it wouk jardly 
be surprising, given the limited memory and 
attention spans of children, to discover that they 
do too. These parsing preferences must somehow 
be neutralized or factored out of an experimental 
task whose objective is the assessment of 
children's knowledge of syntactic rules and 
constraints. 

Plans. The formation of an action plan is an 
important aspect of any comprehension task 
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involving the manipulation of toys or other 
objects. If the plan for manipulating objects 
appropriately in response to a test sentence is 
necessarily complex to formulate or to execute, its 
difficulty for a child subject may mask his correct 
comprehension of the sentence. Thus we need to 
develop a better understanding of the nature and 
relative complexity of such plans, and also to 
devip'* experimental paradigms in which their 
impact on performance is minimized. 

Presuppositions. A variety of pragmatic 
considerations must also be taken into account, 
such as the contextual fixing of deictic reference, 
obedience to cooperative principles of 
conversation, and so forth. In particular, our 
research suggests that test sentences whose 
pragmatic presuppositions are unsatisfied in the 
experimental situation are also unlikely to provide 
results allowing an accurate assessment of a 
child's knowledge of syntactic principles. It is 
necessary to establish which kinds of 
presuppositions children are sensitive to, and to 
ensure that these are satisfied in experimental 
tasks. 

2.1. Parsing 

2.1.1. Subjacency. One universal constraint 
which should be innate is Subjacency. Subjacency 
prohibits extraction of constituents from various 
constructions, including relative clauses. 
However, in an experimental study by Otsu 
(1981), many children responded as if they 
allowed extraction from relative clauses in 
answering questions about the content of pictures. 
For example, children saw a picture of a girl using 
a crayon to draw a monkey who was drinking milk 
through a straw. They were then asked to respond 
to question (1). 

(1) What is Mary drawing a picture of a 
monkey that is drinking milk with? 

Otsu found that many children responded to (1) in 
a way that appeared to violate Subjacency. In this 
case, the answer that is in apparent violation of 
Subjacency is ''a straw." This is because ''a straw" 
is appropriate only if the what has been moved 
from a position in the monkey drinking milk 
clause as shown in (2a), rather than from the 
Mary drawing picture clause as shown in (2b). 

(2) a) "^What is Mary drawing a picture of a 

monkey [that is drinking milk with . ]? 

b) What is Mary drawing a picture of a 
monkey [that is drinking milk] with J 

— 
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But the monkey drinking milk clause is a relative 
clause, and Subjacency prohibits the what from 
moving out of it. Thus the only acceptable 
structure is (2b), and the on)v acceptable ansnver 
is ''a crayon." If these data are interpreted solely 
in terms of children's grammatical knowledge, 
then the conclusion would then have to be that 
knowledge of Subjacency sets in quite late in at 
least some children. 

As we noted earlier, Otsu suggested that the 
innateness of Subjacency could be salvaged by 
showing that the children who appeared to violate 
Subjacency had not yet mastered the phrase 
structure of relative clauses (of sufficient 
complexity to contain an extractable noun phrase). 
When he conducted an independent test of 
knowledge of relative clause structure, he found, 
as predicted, a correlation between phrase 
structure and Subjacency application in the 
children's performance. However, the children's 
performance was still surprisingly poor: 25% of 
the children who were deemed to have mastered 
relative clauses gave responses involving 
ungrammatical Subjacency violating extractions 
from relative clauses. 

We have argued (Grain & Fodor, 1984) for an 
alternative analysis of Otsu's data, which makes it 
possible to credit children with knowledge of both 
phrase structure principles and constraints on 
transformations from an early age. We claim that 
children's parsing routines can influence their 
performance on the kind of sentences used in the 
Subjacency test; in particular, that there are 
strong parsing pressures encouraging subjects to 
compute the ungrammatical analysis of such 
sentences. Until a child develops sufficient 
capacity to override these parsing pressures, they 
may mask his syntactic competence, making him 
look as if he were ignorant of Subjacency. 

A powerful general tendency in sentence parsing 
by adults is to attach an incoming phrase low in 
the phrase marker if possible. This has been 
called Right Association; see Kimball '1973), 
Frazier and Fodor (1978). In sentence , for 
example, the preferred analysis has with NP 
modifying drinking milk rather than modifying 
drawing a picture, even though in this case both 
analyses are grammatically well-formed because 
there has been no WH-movement 

(3) Mary is drawing a picture of a monkey that 
is drinking milk wiUi NP. 

To see how strong this parsing pressure is, note 
how difficult it is to get the sensible interpretation 
of (3) when a crayon is substituted for NP This 



Right Association preference is still present if the 
NP in (3) is extracted, as in (1). The word with in 
(1) still coheres strongly with the relative clause, 
rather than with the main clause. The result is 
that the analysis of (2) that is most immediately 
apparent is the ungrammatical (2a) in which what 
has been extracted from the relative clause. Since 
this 'garden path' analysis is apparent to most 
adults, it is hardly surprising if some of Otsu's 
child subjects were also tempted by it and 
responded to (1) in the picture verification task by 
saying "a straw" rather than "a crayon." 

We conducted several experiments designed to 
establish the plausibility of this claim that the 
relatively poor performance of children on 
sentences like (1) is due to parsing pressures 
rather than to ignorance of universal constraints. 
In the first experiment, we tested children and 
adults on complement-clause questions as in (4). 
Subjacency does not prohibit extraction from 
complement clauses, so if there were no Right 
Association effects this sentence should be fully 
ambiguous, with both interpretations equally 
available. 

(4) What is Bozo watching the dog jump 
through? 

That is, given a picture in which Bozo the clown 
is looking through a keyhole at a dog jumping 
through a hoop, it would be correct to say either 
the "the keyhole" or "a hoop." Intuitively, though, 
the interpretations are highly skewed for adults, 
with a strong preference for the Right Association 
interpretation CTioop") in which the preposition 
attaches within the lower clause. Our experiment 
showed that the same is true for children. We 
tested twenty 3- to 5-year-olds (mean age 4;6) on 
these sentences using a picture verification task 
just like Otsu's, and 90% of their responses were 
in accord with the Right Association 
interpretation.^ 

Thus children and adults alike are strongly 
swayed by Ri^t Association. This is an important 
result. To the best of our knowledge the question 
of whether children's parsing strategies resemble 
th ose of adults has not previously been 
investigated. But children certainly shoidd show 
the same preferences as adults, if the human 
sentence parsing mechanism > innately 
structured. And the parcimg mechanism certainly 
should be innately structured, because it would be 
pointless to be bom knowing a lot of facts about 
language if one weren't also bom knowing how to 
use those facts for speaking and understanding. It 
is satisfying, then, to have shown that children 
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exhibit Right Association. And the fact that they 
do offers a plausible explanation for why so many 
of them failed Ot8u'& Subjacency test — they were 
listening to their pi rsers rather than to their 
grammars. 

Our other experiments in sivpport of this 
conclusion were designed to show that even people 
whose knowledge of Subjacency is not in doubt — 
i.e., adults — are also tempted to violate 
Sulgacency when it is in competition with Right 
Association. We ran Otsu's Subjacency test on 
adults just as he did with the children. The adults 
gave Subjacency-violating low attachment 
responses to 21% of these questions. This was not 
quite as high a rate as Or the children, but as we 
have noted, adults surely have a greater capacity 
than children do for checking an illicit analysis 
and shifting to a less preferred but well-formed 
analysis before they commit themselves to a 
response. In an attempt to equalize adult self- 
monitoring capacities with those of children, we 
re-ran the Subjacency experiment with an 
additional distracting task (= listening for a 
designated phoneme in the stimulus sentence). 
Under these conditions the adults gave 
Subjacency violating responses to 29% of the 
relative clause constructions, a slightly higher 
rate than the 25% for Otsu's child subjects. 

Escalating still further, we changed the 
sentences so that the grammatically well-formed 
analysis was semantically or pragmatically 
anomalous, as in (5). 

(5) What color hat is Barbara drawing a picture 
of an artist with? 

Under these circumstances, where the semantics 
clearly favored the Subjacenty-violating analysis, 
75% of adults' responses violated Subjacency. This 
makes it very clear that linguistic competence 
may not always be revealed by linguistic 
performance. 

Finally, we ran another study, in which we 
asked adults to classify sentences as ambiguous or 
unambiguous. The sentences were spoken in turn 
with only a few seconds between them, and there 
were 72 of ^hem, so the task was fairly 
demanding. The materials included complement 
questions like (4) and relative clause questions 
like (1), as well as a nguous and unambiguous 
Control sentences of u.^y varieties. The results 
showed a 62% ambiguity detection rate for the 
ambiguous control sentences, with a 16% Yalse 
alarm' rate for the unambiguous control 
sentences. Thus the subjects were able to cope 
with the task tolerably well, though not perfectly. 



What was interesting was that the ambiguity of 
the complement questions was detected only 48% 
of the time, in line with our claim that Right 
Association obscures the alternative reading with 
the prepositional phrase in the main clause. And 
most interesting of all was that 80% of the relative 
clause questions were judged to be ambiguous, 
even though Subjacency prohibits one analysis 
and renders them unambiguous. Our explanation 
for this extraordinary result is that the subjects 
first computed the Right Association analysis 
favored their parsing routines, then recognized 
that this was unacceptable because of Subjacency, 
and so rejected it in favor of the analysis with the 
prepositional phrase in the main clause. We 
assume that it was this rapid shift from one 
analysis to the other that gave our subjects such a 
strong impression that these sentences were 
ambiguous. Note that if this misanalysis-with- 
revision occurs 80% of the time for adults, only a 
slight handicap in children's ability to revise 
would be sufficient to account for their errors. 

To sum up: we still have no positive pr^of that 
Subjacency is innate, but at least now there is no 
evidence against it. Our experiments make it 
plausible that children as young as can be tested 
are like adults both with respect to their 
knowledge of this universal constraint and with 
respect to their parsing routines — they are just 
not very good yet at coping witii conflicts between 
the two. 

2.1 .2. Backward pronominalization« A 
fundamental constraint on natural language is the 
structure dependence of linguistic rules. The 
innateness hypothesis implies that children's 
earliest grammars should also exhibit structure 
dependence— even if their linguistic experience 
happens to be equally compatible with structure- 
independent hypotheses. However, it has been 
proposed that children initially hypothesize a 
structure-independent constraint on anaphora, 
prohibiting all cases of backward 
pronominalization (Solan, 1983).^ Backward 
pronominalization consists of coreference between 
a noun phrase and a preceding pronoun, as 
indicated by the indices in (6). 

(6) Thathei kissed the lion made the ducki 
happy. 

We will argue that children do in fact permit 
backward pronominalization, subject to structure- 
dependent constraints. We contend that the 
appearance of a general restriction against 
backward pronominalization is due to a parsing 
preference for the alternative 'extra-sentential' 
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reading of the pronoun in certain comprehension 
tasks. The results of a new comprehension 
methodology show that children as young as 2; 10 
admit the same range of interpretations for 
pronouns as adults do. 

Two sources of evidence have been cited as 
evidence that children up to 6 or 6 years uniformly 
reject backward pronominalization. First, children 
who are asked to repeat back a sentence such as 
(7) often respond by converting it into a forward 
pronominalization construction, as in (8) (Lust, 
1981). 

(7) Because she was tired, Mommy was sleeping. 

(8) Because Momn.y was tired, she was sleeping. 

The fact that these children took the trouble to 
exchange the pronoun and its antecedent certainly 
indicates that they disfavor backward pronom- 
inalization in their own productions. But it does 
not show that the backward pronominalization 
interpretation is not compatible with the child's 
grammar, as suggested by Solan (1983). To the 
contrary, the conversion of (7) to (8) shows that 
children do accept backward pronominalization in 
comprehension; for they would think of (8) as an 
acceptable variant of (7) only if they were 
interpreting the pronoun in (7) as coreferential 
with the subseonert lexical noun phrase (Lasnik 
& Grain, 1985). 

Second, it has been found that when the acting- 
out situation for a sentence like (6) includes a 
potential referent for he other than the duck (e.g., 
a farmer), this unmentioned object is usually 
favored by the children as the referent of the 
pronoun (Solan, 1983; Tavakolian, 1978). In 
contrast to the prevailing view, w3 would attribute 
this to a parsing preference for the extra- 
sentential interpretation of the pronoun; it does 
not have to be taken as evidence that children 
have a grammatical prohibition against backward 
anaphora. Cur suggestion, then, is that children's 
knowledge might be comparable to that of 
adults, even if their performance differs. 

It is particularly important to keep this 
distinction in mind for potentially ambiguous 
constructions such as these. When a sentence has 
more than one possible interpretation, the 
interpretation that children select can tell us 
which interpretation they prefer, it cannot show 
that others are unavailable^ to them. After all, 
adults also exhibit biases in connection with 
ambiguous constructions, but this does not lead us 
to accuse them of ignorance of alternative 
interpretations. To establish how much children 



actually do know, we should look for the factors 
that might be biassing their interpretations, and 
also for ways of minimizing this bias so that 
interpretations which are less preferred but 
nevertheless acceptable to them have a chance of 
showing through. 

The most likely general source of bias against 
backward pronominalization is the fact that 
interpretation of the pronoun would have to be 
delayed until the antecedent is encountered later 
in the sentence. This retention of uninterpreted 
items may strain a child's limited working 
memory. There is some evidence for this 
speculation. Hamburger and Grain (1984) have 
noted that children show a tendency to interpret 
adjectives immediately, without waiting for the 
remainder of the noun phrase, even in cases 
where this leads them to give incorrect responses. 
And Glaik (1971) has observed errors attributable 
to children's tendency to act out a clause 
immediately without waiting for other clauses in 
the sentence. The only way to interpret the 
pronoun immediately in a sentence like (6) is to 
assign it an extra-sentential referent, as children 
typically do. 

If this proposal is correct, it should be that 
children will accept backward pronominalization 
in an experimental task that presses subjects to 
access every interpretation they can assign to a 
sentence. Grain and McKee (1985) used a 
true/false paradigm in which subjects judge the 
truth value of sentences against situations acted 
out by the experimenter. The sentences were as in 
(9), where either a coreferential reading or an 
extra-sentential reading of the pronoun is 
possible. 

(9) When he went into the bam, the fox stole 
the food. 

On each trial, a child heard a sentence following 
a staged event acted out by one of two 
experimenters, using toy figures and props. The 
second experimenter manipulated a puppet, 
Kermit the Frog. Following each event, Kermit 
said what he thought had happened on that trial. 
The child's task was to indicate whether or not che 
sentence uttered by Kermit accurately described 
what had happened. Ghildren were asked to feed 
Kermit a cookie if he said the right thing, that is, 
if what he said was what really happened. In this 
way, 'true' responses were encouraged in the 
experimental situation. But sometimes Kermit 
would say the wrong thing, if he wasn't paying 
close attention. When this happened, the child 
was asked to make Kermit eat a rag. (In pilot 
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work without the rag ploy, we had found that 
children were reluctant to say that Kermit had 
said something wrong.) 

To test for the availability of both 
interpretations of an ambiguous sentence like (9), 
children judged it twice during the course of the 
experiment, once following a situation in which a 
fox stole some chickens from inside a bam (for tte 
backward pronominalization interpretation), and 
once following a situation in which a man stole 
some chickens while a fox was in a bam (for the 
extra-sentential interpretation). 

Children accepted the backward anaphora 
reading for all the ambiguous sentences 73 of the 
time. The extra-sentential reading was accepted 
81 of the time, but the difference was not 
significant. Much the same results were obtained 
even for the 7 youngest children, whose ages were 
from 2;10 to 3;4. Only two of the 62 subjects 
consistently rejected the backward anaphora 
reading. Thus most children find the backward 
anaphora reading acceptable, although it might 
not be preferred if they were forced to choose 
between interpretations, as in previous 
comprehension studies. 

We should note that a variety of control 
sentences were also tested to rule nut other, less 
interesting, explanations of the children's 
performance. For example, the children rejected 
sentence (10) following a situation in which 
Strawberry Shortcake did eat an ice cream, but 
not while she was outside playing. This shows 
that they were not simply ignoting the 
subordinated clauses of sentences in deciding 
whether to accept or reject them. 

(10) When she was outside playing. 
Strawberry Shortcake ate an ice cream. 

Sentences like (11) were also tested in order to 
establish that subjects were not merely giving 
positive responses to all sentences, regardless of 
their grammatical properties. 

(11) He stole the food when the fox went into 
the bam. 

The difference between (11) &nd the acceptable 
backward pronominalization in (9) is that in (11) 
the pronoun is in the higher clause and c- 
commands the foxt while in (9) the pronoun is in 
the subordinate clause and does not c-command 
the fox, (A node A in a phrase marker is said to c- 
command a node B if there is a route from A to B 
which goes up to the first branching node above A, 
and then down to B. Note that c-command is a 
structure-dependent relation.) There is a 
universal constraint that prohibits a pronoun from 



c-commanding its antecedent. And indeed the 
children did reject (11) 87% of the time. Note that 
this positive result shows that the children have 
early knowledge not only of the absence of linear 
sequence conditions on pronominalization, but 
also of the existence of structural conditions such 
as c-command. (See also Lust, 1981, and 
Goodluck, 1986.) 

Subject/Auxiliary Inversion. Another 
study (Crain & Nakayama, 1987) also explored 
the tie between children's errors in acquisition 
tasks and sentence processing problems. This 
study was designed to test whether children give 
structure-dependent or structure-independent 
responses when they are required to transform 
sentences by performing Subject/Auxiliary 
inversion. As Chomsky (1971) pointed out, 
transformational rules are universally sensitive to 
the stmctural configurations in the sentences to 
'"^hich they apply, not just to the linear sequence 
of words. 

The procedure in this study was simply for the 
experimenter to preface declaratives like (12) with 
the carrier phrase "Ask Jabba if...," as in (13). 

(12) The man who is running is bald. 

(13) Ask Jabba if the man who is running is 
bald. 

The child then had to pose the appropriate 
yes/no questions to Jabba Uie Hutt, a figure from 
"Star Wars'* who was being manipulated by one of 
the experimenters. Following each question, 
Jabba was shown a picture and would respond 
"yes* or "no." 

The sentences all contained a relative clause 
modifying the subject noun phrase. The correct 
structure-dependent transformation moves the 
first verb of the main clause to the front of the 
sentence, past the whole subject noun phrase, as 
in (14). An incorrect, structure-independent 
transformation would be as in (15), where the 
linearly first verb in the word string (which 
happens to be the verb of the relative clause) has 
been fronted. 

(14) Is the man who is running bald? 

(15) *Is the man >yho ninning is bald? 

For simple sentences with only one clause such 
as (16), which are more frequent in a young child's 
input, both versions of the transformation rule 
give the correct result. 

(16) Ask Jabba if the man is bald. 

(17) Is the man bald? 
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It is only on the more complex sentences that the 
form of the child's rule is revealed* 

The outcome was as predicted by the innateness 
hjrpothesis: children never produced an incorrect 
sentence like (15). Thus, a structure-independent 
strategy was not adopted in spite of its simplicity 
ard in spite of the fact that it produces the correct 
question forms in many instances* The findings of 
tiiis study thus lend further support to the view 
that the initial state of the human language 
faculty contains structure-dependence as an 
inherent property. 

The children did make some errors in this 
experiment, and we observed that most of them 
were in sentences with a long subject noun phrase 
and a short main verb phrase, as in (1&). 

(18) Is the boy who is holding the plate crying ? 

By contrast, there were significantly fewer 
errors in sentences like (19), which has a shorter 
subject noun phrase and a longer verb phrase. 

(19) Is the boy who is unhappy watching 
Mickey Mouse ? 

This kind of contrast is familiar in parsing 
studies with adults. In particular, Frazier and 
Fodor (1978) showed that a sequence consisting of 
a long constituent followed by a short constituent 
is especially troublesome for the (adult) parsing 
routines;^ a short constituent before a long one is 
much easier to parse. The distribution of the 
children's errors in the Subject-Auxiliary 
Inversion task may therefore be indicative not of 
inadequate knowledge of the inversion rule, but of 
an adult-like processing sensitivity to interactions 
between structure and constituent length. 

A follow-up study to test this possibility was 
conducted by Nakayama (1987). Nakayama 
systematically varied both the length and the 
syntactic structure of the sentences to be 
transformed by the children. The children made 
significantly fewer Subject-Auxiliary Inversion 
errors in response to embedded questions with 
short relative clauses (containing intransitive 
verbs) as compared to those with long relative 
clauses (containing transitive verbs). With length 
held constant, the children had more difficulty 
with relative clauses that had object gaps, as in 
(20), than with relative clauses that had subject 
gaps, as in (21) (although this eiTect was not quite 
significant). 

(20) The ball the girl kicked is rolling. 

(21) The boy who was slapped is crying. 



The ease of subject gap constructions, as 
compared to object gap constructions, has been 
found in a number of other studies in language 
development, in language impaired populations, 
and in experiments on adult sentence processing 
(where the question of syntactic competence is not 
in doubt). It seems reasonable to interpret these 
results as confirming that children's error rates in 
language tests are highly sensitive to the 
complexity of the sentence parsing that is 
required. 

2J2. Presupposition 

Syntactic parsing is not the only factor that has 
been found to mask knowledge of syntactic 
principles. Test sentences whose pragmatic 
presuppositions are unsatisfied in the 
experimental situation have been found to result 
in inaccurate assessments of children's structural 
knowledge. In this section we consider two 
experiments that point to the relevance of 
presuppositional content in sentence 
understanding. 

The structures we discuss here are relative 
clauses and temporal adverbial clauses. A word of 
clarification is needed before we proceed. Up till 
now «-e have restricted the scope of the innate 
hypothesis to universal constraints (like 
Subjacency, and structure-dependence), which 
could not in principle be learned from normal 
linguistic experience (i.e., without extensive 
corrective feedback). But now we v/ant to extend 
the innateness hypothesis to a broader class of 
linguistic knowledge, knowledge of universal type<t 
of sentence construction. We cannot plausibly 
claim that every aspect of these constructions is 
innate. Rather, every construction will have some 
aspects that are determined by innate principles, 
and other aspects that must be learned. And the 
balance between these two elements varies from 
construction to construction. So it is perfectly 
acceptable on theoretical grounds that some 
constructions should be acquired later than 
others. However, the inn^^teness hypothesis is not 
comDatible with just any order of acquisition. It 
predicts early acquisition of constructions that 
Chomsky calls 'core' language, i.e., the 
constructions that have strong assistance from 
innate principles with just a few parameters to be 
set by learners on the basis of experience. It would 
be surprising to discover that knowledge of these 
constructions was significantly delayed once the 
relevant lexical items had been learned. In the 
absence of a plausible explanation, this would put 
the innateness hypothesis at risk. 
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We noted in section 1 a range of possible 
explanations of apparently delayed knowledge of 
linguistic facts. In the present case they would 
include the following: 

• the construction does not» after all, belong to 
the core but is 'peripheral' and hence should 
be acquired late; 

• children don't hear this construction until quite 
late in the course of language development and 
so could not be expected to know it exists; 

• the core principles in question undergo 
maturation and so are not accessible at early 
stages of acquisition; 

• the experimental data are faulty and children 
do indeed have knowledge of this construction. 

We will argue for this last alternative. And just as 
in the previously described studies of innate 
constraints, we will lay the blame for the 
iuisleading experimental data on the fact that 
traditional experimental paradigms do not make 
sufficient allowance for the limited memory and 
computational capacities of young children. Onca 
again, our story is that non-linguistic immaturity 
can create the illusion of linguistic immaturity. 

2^.1. Relative Clauses. Children typically 
make more errors in understanding sentences 
containing relative clauses (as in 22) than 
sentences containing conjoined clauses (as in 23), 
when comprehension is assessed by a figure 
manipulation (act-out) task. 

(22) The dog pushed the sheep that jumped 
over the fence. 

(23) The dog pushed the sheep and jumped 
over the fence. 

The usual finding that (22) is more difficult for 
children than (23) up to age 6 years or so has been 
interpreted as an instance of late emergence of the 
rules for subordinate syntax in language 
development (e.g., Tavakolian, 1981). However, 
though coordination may be innately favored over 
subordination, it is also true that subordination is 
ubiquitous in natural language; relative clause 
construct'ons are very close to the 'core,' So 
ignorance of relative clauses until age 6 would 
stretch the innateness hypothesis. 

Fortunately this is not how things stand. 
Hamburger and Crain (1982) showed that the 
source of children's performance errors on this 
task is not a lack of syntactic knowledge. By 
constructing pragmatic contexts in which the 
presuppositions of restrictive relative clauses were 
satisfied, they were able to demonstrate mastery 



of relative clause structure by children as young 
as 3 years. There are two presuppositions in (22): 
(i) that there are at least two sheep in the context, 
and (ii) that one (but only one) of the sheep 
jumped over a fence prior to the utterance. The 
reason why previous studies failed to demonstrate 
early knowledge of relative clause constructions, 
we believe, is that they did not pay scrupulous 
attention to these pragmatic presuppositions. For 
example, subjects were required to act out the 
meaning of a sentence such as (22) in contexto in 
which only one sheep was present. The poor 
performance by young children in these 
experiments was attributed to their ignorance of 
the linguistic pDperties of relative clause 
constructions. Cat suppose that a child did know 
the linguistic properties, but that he also was 
aware of the associated presuppositions. Such a 
child might very well be unable to relate his 
correct understanding of the sentence to the 
inappropriate circumstances provided by the 
experiment Adult subjects may be able to 'see 
through' the unnaturalness of an experimentel 
task to the intentions of the experimenter, but it is 
not realistic to expect this of young children. 

Following this line of reasoning. Hamburger and 
Crain (1982) made the apparently minor change of 
adding two more sheep to the acting out situation 
for sentence (22), and obtained a much higher 
percentage of correct responses. The most 
frequent remaining 'error' was failure to act out 
the e\ent described by the relative clause, but 
since felicitous usage presupposes that this event 
has already occurred, this is not really an error 
but is precisely the kind of response that is 
compatible with perfect comprehension of the 
sentence. This interpretation of the data is 
supported by the fact that there was a positive 
correlation between incidence of this response 
type and age.'^ 

We have conducted another series of studies on 
relative clauses, trying several other techniques 
for assessing grammatical competence. In one 
study, we employed a picture verification 
paradigm to see if children could distinguish 
relative clauses from coi\joined clauses, despite 
the claim of Tavakolian (1981) that they 
systematically impose a conjoined clause analysis 
on relatives. In this study, seventeen 3- and 4- 
year-olds responded to relative clause 
constructions like (24). 

(24) The cat is holding hands with a man who 
is holding hands with a woman. 

(25) The cat is holding hands with a man and 
is holding hands with a woman. 
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This sentence was associated with a pair of 
pictures, one that was appropriate to it and one 
that was appropriate to the superficially similar 
cor\joined sentence (26). Seventy percent of the 3- 
year-olds' responses and 94 of the 4-year-olds' 
responses matched sentences with the appropriate 
picture rather than with the one depicting the 
conjoined clause interpretation. 

A second technique we tried used a 'silliness' 
judgment Usk (see Hsu, 1981) to establish 
whether children can differentiate relative clauses 
from coigoined clauses. Ninety-one percent of the 
responses of the twelve 3- and 4-year-olds tested 
categorized as 'silly' sentences such as (26), 
although sentences such as (27) were accepted as 
sensible 87% of the time. 

(26) The horse ate the hay that jumped over 
the fence. 

(27) The man watched the horse that jumped 
over the fence. 

Notice that sentence (26) would not be anomalous 
if the that -clause were misinterpreted as an and - 
clause, or if it were interpreted as extraposed from 
the subject NP; in both cases, the horse would be 
the understood subject of the relative clause. The 
results therefore indicate that most children 
interpret the Ma/-clause in this sentence 
correctly, i.e., as a subordinate clause modifying 
the hay. Informal testing of adults suggests that 
the only respect in which children and adults 
differ on the interpretation of relative clauses is 
that the adults are somewhat more likely to 
accept the extraposed relative analysis as well, 
though even for adults this analysis is much less 
preferred. 

A third experiment, on the phrase structure of 
relative clause constructions, indicates that 
children, like adults, treat a noun phrase and its 
modifying relative clause as a single constituent, 
inasmuch as they can construe it as the 
antecedent for a pronoun such as one. 

In a picture verification study, fifteen 3- to 5- 
year-olds responded to the instructions in (28). 

(28) The mother frog is looking at an airplane 
that has a woman in it. The baby frog is 
looking at one too. Point to it. 

Ninety-three percent of the time the subjects 
chose the picture in which the baby frog was 
looking at an airplane with a woman in it, in 
preference to the picture in which the baby frog 
was looking at an airplane without a woman in it. 
That is, the relative clause was included in the 
noun phrase assigned as antecedent to the 
pronoun. 



In short: the weight of evidence now indicates 
that children grasp the structure and mearnng of 
relative clause constructions quite early in the 
course of language acquisition, as would be 
expected in view of the central position of these 
constructions in natural language. 

2.2.2. Temporal Terms. Another line of 
research has yielded support for the claim that 
presupposition failure is implicated in children's 
poor linguistic performance. These studies 
employed sentences containing temporal clauses 
with before and after, as in (29). 

(29) Push the red car to me before/after you 
push the blue car. 

Clark (1971) and Amidon and Carey (1972) have 
claimed that most normal, 3- to 5-year-olds do noc 
understand these sentences appropriaf ^ly. {l^nce 
Amidon and Carey established that the children 
, were familiar with concepts of temporal sequence 
(e.g., as expressed by words like first and last), the 
implication is that the structure of these adverbial 
clauses is beyond the scope of the child's grammar 
at this age. 

However, the acting-out tasks emplo>.^ in these 
studies were once again unnatural ones which 
ignored the presuppositional content of the test 
sentences. Felicitous usage of sentence (29) 
demands that the pushing of the blue car has 
already been contextually established by the 
hearer as an intended, or at least probable, future 
event; but this was not established in the e 
experimental tasks. It is very likely, then, that 
these studies underestimated children's ability to 
comprehend temporal subordinate clauses. For 
example, Amidon and Carey reported that five 
and six year old children who were not given any 
feedback frequently failed to act out the action 
described in the subordinate clause. Johnson 
(1975) found that four and five year old children 
correctly acted out commands such as those in (30) 
only 51% of the time; again, the predominant 
error was failure to act out the action described in 
the subordinate clause. 

(30) & Push the car before you push the 
truck. (SI before S2) 

b. After you push the motorcycle, push 
the bus. (After SI, S2) 

c Before you push the airplane, push the 
car. (Before S2,Sl) 

d. Push the truck after you push the 
helicopter. (S2 after SI) 
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Grain (1982) satisfied the presupposition of the 
subordinate clause by having the subordinate 
clause act correspond to an intended action by the 
subject, and observed a striking increase in 
children's performance. To satisfy the 
presupposition, children were asked, before each 
command, to choose a toy to push on the next 
trial. The child's intention to push a particular toy 
was incorporated into the command that was 
given on that trial. For instance, sentence (30d) 
could be used felicitously for a child who had 
expressed his intent to push the helicopter. 
Correct responses (i.e., responses in which both 
the main clause and subordinate clause action 
were performed, and in the correct oxder) were 
produced 82% of the time. Grain's interpretation 
of these results was that the children's improved 
performance was due to the satisfaction of the 
presupposition of the subordinate clause. 

However, we now note that the results of that 
study are open to another interpretation. It may 
be that improved performance was not due 
specificiilly to the contextual appropriateness of 
the sentence, but to the fact that the child's task 
was simplified becau^^ he was provided with more 
advance information concerning what his task 
would be. In the act-out or 'do-what-I-say' 
paradigm applied to temporal terms, the child 
must discern two aspects of the command: (i) 
which two toys to move, and (ii) in which order to 
move them. ;f the child has established his intent 
to move a particular toy, his task involving (i) is 
simplified. Thus, improved performance may be 
due to the satisfaction of presuppositions or it may 
be due to the additional information the child 
possesses. 

Another study was conducted to disentangle 
these two factors (Gorrell, Grain, & Fodor, 1989). 
In this study, there were four groups of subjects. 
One group, the Felicity Group (F), was given 
commands conuuning before and after with prior 
information about the subordinate clause action, 
just as in the previous experiment. A second 
group, the Information oup (I) received prior 
information about the main clause action; note 
that this does not satisfy the presupposition of the 
sentence. There was also a third group, the No 
Gontext Group (NG), who received no advance 
information at all, and a fourth group, the Felicity 
plus Information Group (FI), who received 
information over and above what would satisfy the 
felicity conditions since they chose both actions in 
advance. Consider, for example, a subject in the F 
group. He would be asked to choose a toy to push. 



If he chose the bus, for example, a typical 
command would be (31). 

(3 1) Push the car before you push the bus. 

On the other hand, <i subject in the I group who 
had chosen the bus would be given the command 
(32). 

(32) Push the bus before you push the car. 

Fifty-six children participated in the study, 
ranging in age from 3;4 to 5;10 (mean 4;5). Each 
child was assigned to one of the four groups, 
which were of equal size and approximately 
matched for age. The 'game* equipment consisted 
of 6 toy vehicles arranged in a row on a table 
between the child and the experimenter. The 
stimulus set consisted of 12 commands spoken by 
the experimenter which the child was to act out 
There were three sentences of each of the four 
types illustrated in (30) above. We were careful to 
balance order of choice with order of action and 
assignment to clause type. 

The results showed a significant difference 
between the F and FI groups on one hand, and the 
I and NG groups on the oUier. Table 1 shows the 
percentages of correct responses, where a correct 
response consisted of performing both actions in 
the sequence specified the sentence. 

Table 1. 

Main Clause Infoimatioo 



Subordinate 


FI 


80% 


F 


74% 


77% 


Clause Infamuuion 


I 


51% 


NC 


59% 


55% 


mean 




65% 




66% 





Note that the relevant factor is whether 
subordinate clause information was provided in 
advance. An analysis of variance confirmed that 
the mere amount of information provided makes 
no significant difference. The FI group performed 
better than the F group by only 6 percentage 
points, which does not approach statistical 
significance. And the I group performed just a 
little worse (non-significantly again) than the NC 
group. 

Although our study was not specifically 
designed to assess age differences, we performed a 
post hoc breakdown of correct responses by two 
age groups: under 4;4, and 4;4 and over. The older 
group, as one would expect, performed somewhat 
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better than the younger group. What is perhaps 
most rnte^esting is that the younger group appear 
to be even more sensitive than the older group to 
the proper contextual embedding of utterances.^ 

A breakdown of the types of errors that occurred 
reveals that the predominant errors are (i) acting 
out the KAin clause only, and (ii) reversing the 
correct order of the actio As noted above for 
relative clause constructions, acting out the main 
clause only is a quite reasonable response given 
that the context failed to satisfy the subordinate 
clause presupposition. And in fact most of these 
errors were found in the I and NC groups.*^ 
Reversals were the most frequent error ^pe in the 
study though they constituted only 19% of all 
responses. These errors may reflect a genuine lack 
of comprehension of either the temporal terms or 
the relevant syntactic structure. However no child 
in either the F or FI groups produced a consistent 
response pattern which would indicate that this 
was the case, so it seems mere likely that these 
errors were due primarily just to occasional 
inattention. 

T^e main conclusion we draw from these results 
is that children, from a very young age, are indeed 
sensitive to the proper contextual embedding of 
language. Their perfon. mce is facilitated by 
satisfying the presuppositions of temporal 
subordinate clauses, and information which does 
not satisfy the presuppositions does not result in 
facilitetion. 

A secondary conclusion is that children do 
construct the appropriate syntactic structure for 
sentences with embedded dauses. If the children 
in our study had failed to distinguish main from 
subordinate clauses (e.g., by assigning t lat* 
conjunction-type structure to the experimenter's 
commands), we would not expect to find the 
difference between th<9 F and I groups we 
observed. Nor is it plausible to suggest that the 
children relied .^on a structure-independent 
formula of 'ol^ information precedes new 
information.' For example, for the F group, the 
new information was always in the main clause. If 
children were assuming that old information 
would be first, we would have expected relatively 
poor performai e from the F group on sentences 
in which the main clause preceded the 
subordinate clause. In fact, no such effect was 
observed. 

In sum: once again, the linguistic knowledge of 
young children, when freed of interfering 
influences, appears to be quite advanced. Adults 
have the aHlity to set aside contextual factors in 



an unnatural experimental situation, but 
children, with their more limited cognitive and 
social skills, apparently do not have this ability. 
Consequently, they are highly sensitive to 
pragmatic infelicities. A- i therefore their 
linguistic knowledge can be accurately appraised 
only by tests which include controls to insure that 
they are not penalized by their knowledge of 
pragmatic principles. 

Another possible source of poor performance by 
children is in formulating the action plans which 
are needed in order to obey an imperative, or act 
out the content of a declarative sentence which 
they have successfully processed and understood. 
As we use the term, a plan is a mental 
representation used to guide action. A plan may 
be simple in structure, consisting of just a list of 
actions to be performed in sequence; or it may be 
internally complex, with loops and branches and 
other such structures now familiar in computer 
programs. 

Formulating a plan is a skill that makes 
demands on memory and computational 
resources. In certain experimental tasks, these 
demands may outweigh those of the purely 
linguistiv processing aspecte of the task. So when 
children perform poorly, it is important to 
consider the possibility that formulating, storing 
or executing the relevant action plan is the source 
of the problem, rather than imperfect knowledge 
of the linguistic rules or an inability to apply them 
in parsing th'> ^ntence at hand. 

2^.1. Fren minal Modifiers. The first study 
on plans that we conducted was in response the 
claim by Matthei (1982) and Roeper (1972) that 4- 
to 6-year-olds have difficulty in interpreting 
phrases such as (33) containing both an ordinal 
and a descriptive adjective. 

(32. the second striped ball 

Confronted ^vith an array such as (34), many 
children selected item (ii), i.e., the ball which ip 
second in the array and also is striped, rather 
than item (iv) which is the second of the striped 
balls ^counting from the left as the children were 
trained to do). 

(34) Array for "the second striped bair 

© ® © ® ® 

(«) (ii) (iii) (iv) (v) 
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The empirical finding, then, appears to be that 
children assign an interpretation that is not the 
same as an adult would assign to expressions of 
this kind. This difference is attributed by Matthei 
to children's failure to adopt the hierarchical 
phrase structure internal to a noun phrase that 
characterizes the adult grammar. This structure 
is shown in (35). Instead, Matthsi argues that 
children adopt a 'flat structure' for phrases of this 
kind, with both the ordinal and the descriptive 
a4jective modifying the noun directly as in (36). 



(35) 



NP 
/ \ 

DBF 

/ \ 

ADJ N* 

/ \ 

ADJ N 

i I 

the Moood siriped bdl 



(36) 

NP 
✓ \ 

ADJ ADJ N 

I I I 

Moood Siriped bflU 



Any divergence between children's and adults' 
grammars poses a problem from the standpoint of 
language acquisition theory; namely, explaining 
how the child ultimately converges on the adult 
grammar without correction or other 'negative' 
feedback. Fortunately, there is no need to assume 
an error in the children's grammar in this case, for 
liiere is an alternative component of the language 
processor in which the errors might have arisen. 
In a series of experiments. Hamburger and Crain 
(1984) show that mc ' children do assign the adult 
phrase structure aiio io understand the phrase 
correctly as referring to the second of the striped 
balls. The difHculty that children experience 
arises when they attempt to derive from this 
interpretation a p. .cedure for actually identifying 
the relevant item in the array. An analysis of the 
logical structure of the necessary procedure shows 
it to be quite complex, significantly more so than 
the procedure for **count the striped balls," the 
kind of phrase Matthei used in a pretest in an 
attempt to show that children were able to cope 
with the nonsyntactic demands of the task. 

This procedural account of the children's errors 
is supported by the sharp improvement in 
performance that results from three changes in 
method. One change is the inclusion of a pretask 
session in which the children handle and count 
homogeneous subsets of the items which are 
subsequently used in an array. This experience is 
assumed to prime some of the procedural planning 
required in the main experimental task. A second 



change in method is to withhold the display while 
the sentence is being uttered, so that formation 
and execution of the plan are less likely to 
interfere with each other. A dramatic 
improvement in performance on a phrase like (33) 
also results from first asking the child to identify 
ihe first striped ball, which forces him to plan and 
execute part of the procedure he will later need for 
(33). Facilitating the procedural aspects of the 
task thus makes it possible for the child to reveal 
his mastery of the syntax and semantics of such 
expressions. 

Hamburger and Crain also found quite direct 
evidence that children do not assign the 'flat 
structure' analysis. The standard assumption in 
linguistics is that proforms corefer with a 
syntactic constituent In the correct structure (35), 
the words striped and 6a// form a complete 
constituent, but in the incorrect structure (36) 
they do not. Thus the children should permit the 
proform one to corefer with striped ball only if 
they have the correct hierarchical structure. To 
find out whether they permit this coreference, 
they were tested on the instructions in (37), with 
th* array in (38). 

(37) Point to the first striped ball; point to the 
second one. 



(38) 



m • 



(i) 



(ii) 



(iu) (iv) 



(V) 



hamburger and Crain found that the children 
consistently responded to the second instruction 
by pointing to (v) rather than (iv), showing that 
they took the proform one to corefer with 
expressionb like striped ball. Thus it appears that 
they do know the structure (35). 

Finally, we note two experiments by Hamburger 
and Crain (1987). The purpose of these 
experiments was to provide empincal support for 
our claim that response planning 6 an important 
factor in psychoHnguistic tasks, independently of 
syntax and semantics. 

The first experiment attempts to show that 
children's ability ^ comprehend a phrase is 
inversely related to the complexity of the 
a'^sociated plan. For this purpose we compare 
phrases that are arguably equal to each other in 
syntactic complexity but differ in plan complexity. 
Examples are shown in (39), in increasing order of 
complexity of plans. 

(39) i. John's biggest book 
ii second green book 
iii. second biggest book 
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Planning complexity can also be deconfounded 
from the complexity of semantic constituent 
processing. Note that semantic considerations 
lead to the prediction that (39i), would be hardest, 
because its word meanings have to combine in a 
way contrary to the surface sequence of words 
(= the biggest of John's books). 

The pattern of children's responses supports the 
predictions of our procedural account, and does 
not conform to the account based on syntactic or 
semantic constituency. Children responded 
correctly to phrases like (39i) 88% of the time. 
They gave correct answers only 39% of the time 
for examples like (39ii), and they were only 17% 
correct for phrase like (39iii). Thus, this 
experiment provides clear evidence that plans, not 
linguistic structures (syntactic or semantic) can 
determine processing success and failure for 
young children. 

The second experiment addresses the cognitive 
difficulty of planning by prefacing the test witli a 
sequence of exercises designed to alleviate the 
planning difficulty. This activity does not provide 
any extra exposure to the phrases tested; 
nevertheless, we anticipate a reduction in errors 
on these phrases. Consider a phrase such as "the 
second tallest building." This plan requires the 
interpreter to identify its referent The child must 
integrate sequential pairwise comparisons of 
relative size. In the pre-test activity the child 
would be shown a display of several objects of one 
type (say boxes), but of different sizes, and asked 
to hand the experimenter the biggest one. Then, 
once this object was removed from the array, the 
experimenter asked the child to perform the task 
again, saying, -Now, find the biggest box in this 
group. In this way the child would identify the 
second biggest box without ever hearing the 
phrase "the second biggest box" uttered. 
Children's comprehension of the phrase was 
tested before and after the preparatory task. They 
gave significantly more correct responses (46%) 
following the preparatory task than before it (8% 
in this experiment). This result suggests that 
their difficulty with phrases of this sort stems 
firom the complexity of the response plans. 

3. SENTENCE PRODUCTION 
To acquire a language is to learn a mapping 
between potential utterances and associated 
potential meanings. Successful mastery should 
reveal itself in both comprehension and 
production. In the previous section we were 
concerned with studies of children's 
comprehension, in whi^h their knowledge is tested 



by presenting utterances and observing the 
interpretations that they assign. We now turn to 
tests of children's competence which proceed in 
the other direction: the input to the child is a 
situation, which has been designed to suggest a 
unique sentence meaning, and the behavior we 
observe is the utterance by which the child 
describes that situation. 

It would have been reasonable to expect that the 
sorts of nonsyntactic problems that present 
obstujles for children in comprehension tasks 
might prove to be as hard or even harder for them 
to overcome in production tasks. But we have not 
found this to be *he case. The results of recent 
elicited production studies are dramatically better 
than those of comprehension studies directed to 
the same linguistic constructions. For example, 
Richards < 1976) elicited appropriate uses of the 
deictic verbs come and go from children age 4;0 - 
7;7, while Clark a: d Gamica (1974) reported that 
even 8-year-olds didn't consistently distinguish 
between come an<^ go in a comprehension task. 

The disparity between production and 
comprehension studies is particularly striking 
because it is the reverse of what one woiild expect 
To fmd production superior to comprehension in 
children's language Is as surprising as it would be 
to find production cuperior to comprehension in 
adult second-Ian^ ,age learning, or to find recall 
superior to recognition in any psychological 
domain. It is plausible to argue, therefore, that 
the superiority of production is only apparent, and 
is due to differences in the sensitivities of 
production tests and comprehension tests. And the 
logic of the situation suggests tbn*- it is 
comprehension tests that are deficient. Afler , 
success is hard to argue with. With suitable 
controls, successful production by children is a 
strong indicator of underlying linguistic 
competence, as long as their productions are as 
appropriate and closely attuned to the context as 
adult utterances are. Because there are so many 
ways to combine words incorrectly, consistently 
correct combinations in the appropriate contexts 
are not likely to come about by accident. On the 
other hand, failure on any kind of psychological 
task cannot be secure evidence of lack of the 
relevant knowledge, since the knowledge may be 
present but imperfectly exploited. 

As we saw in the previous section, com- 
prehension studies seem to be particularly 
susceptible to problems of parsing, planr^'ng and 
so forth which impede the full exploitation of 
linguistic knowledge. Production tasks appear ^ 
be less hampered by thesv extra-grammatical 
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factors. This is probably because production 
avoids non-verbal response planning, which we 
have seen is a migor source of difficulty in act-out 
comprehension tasks. It is worth noting also that 
in constructing contexts to elicit particular 
utterance types, we have no choice but to attend to 
the satisfaction of che presuppositions that are 
associated with the syntactic structures in 
question, because otherwise the subjects won't 
utter anything like the construction that is being 
targeted. In elicited production it is delicate 
manipulations of the communicative situation 
that give one control over the subject's utterances. 

8.1. Relative Clauses. In section 2 we 
presented evidence of young children's competence 
with relative clauses. Further confirmation was 
obtained by Hamburger and Grain (1982), using 
an elicited production methodology. Pragmatic 
contexts were consti acted in which the 
presuppositions of restrictive relatives were 
satisfied. It was discovered that children as young 
as three reliably produce relative clauses in these 
contexts. 

A context that is uniquely felicitous for a 
relative clause is one which requires the speaker 
to identify to an observer which of two objects to 
perform some action on. In our experiment, the 
observer is blindfolded during identification of a 
toy, so the child cannot Identify it to the observer 
merely by pointing to it or saying 'hislthat one. 
Also, the differentiating property of the relevant 
toy is not one that can be c acoded merely with a 
noun (e.g., the guard) or a prer'OirJnal ac^'ective 
(e.g., the big guard) or a prepositional phrase (e.g., 
the guard with the gun), but involves a moro 
complex state or action (t,.g., the guard that is 
shooting Darth Vader). Young children reliably 
produce meaningful utterances w:th relative 
clauses when these felicity conditions are met t or 
example: 

(40) Jabba, please come over to point to the one 
that's asleep. (3;5) 

Point to the one that's standing up. (3;9/ 
Point to the guy who'^i going to get killed. 
(3;9) 

Point to the kangaroo that 3 eating the 
st7rawberry ice cream. (3;11) 

Note that the possibility of imitation is excluded 
because the experimenter takes care not to use 
any relative clause constructions in the elicitation 
situation. This technique has now been extended 
to younger children (as young as ?;8), and to the 
elicitation of a wider array of relative clause 



constructions, including relatives with object gaps 
(e.g., the guard that Princess Leia is standing on). 

3.2. Passives. Bo- and Wexler (1987) have 
argued that A-chains, which are involved in the 
derivation of verbal passive constructions, are not 
available to children in the first few years.^ Borer 
and Wexler maintain that knowledge of A-chains 
is innate, but becomes accessible only after the 
language faculty undergoes maturational change. 
We were not convinced, however, that this 
maturation hypothesis is necessitated by the facts. 
Rather, the facts seem to be consistent with A- 
chains being innate and accessible from the 
outset. 

One source of data cited in support of the 
maturadon hypothesis is the absence of full 
passives in the spontaneous speech of young 
children. But this of course is not incontrovertible 
evidence that children's grammars are incapable 
of generating passives. Full passives are rarely 
observed in adults' spontaneous speech either, or 
in adult speech to children. But their paucity is 
not interpreted in this case as revealing a lack of 
grammatical knowledge. Instead, it is understood 
as due to the fact that the passive is a marked 
form which it is appropriate to ur / in certain 
discourse con'^exts, in most contox. .le active is 
acceptable and more natural, or a reduced passive 
without a 6y-phrase is sufficient. That is, the 
absence of full verbal passives in adult speech is 
assumed to be a consequence of the fact that it's 
only in rare situations that the full passive is 
uniquely feiicito'is. But the same logic that 
explains why adults produce so few full passives 
may apoly equally to children. Perhaps they too 
have knowledge of this construction, but do not 
use it except where the communicative situation is 
appropriate. 

We have tested this possibility in an experiment 
with thirty-two 3- and 4-year-old children. (Grain, 
Thornton, & Murasugi, 1987). One experimenter 
asked the child to pose questions to another 
experimenter. The pragmatic context was 
carefully controlled so that questions containing a 
full verbal passive wouW be fully appropriate. The 
following protocol illu;»trates the elicitation 
technique: 

Adulu See, the Incredible Huik is hitting one of the 
soldiers Look over here. Darth Vader goes over 
and hits a soldier. So Darth Vader is also hitting 
one of the soldiers. Ask Keiko which one. 

Child to Keiko: Which soldier is getting hil by Darth 
Vader? 
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Note that the child knows what the correct answer 
is to his question, and that he cannot expect to 
elicit this answer from his interlocutor (Keiko) 
unless he includes the 6y-phrase. In fact, exactly 
50% of responses were passives with full by 
-phrases). Of course, active constructions are also 
felicitous in this context (e.g.. Which soldier is 
Darth Voder hitting?), even though the contextual 
contrast with another agent (the Incredible Hulk) 
may tend to favor the passive stylistically. And 
indeed 31% of responses were active questions 
with object gaps. The other 19% of responses 
included mostly sentences that were grammatical 
but not as specific as the context demanded (e.g., 
passive lacking 6j^-phrases). 

Using this technique, we were able to elicit full 
verbal passives from all but three of the thirty-two 
children tested so far, including ones as young as 
3;4. Some examples are shown in (41). 

(41) She got knocked down by the Smurfie. (3;4) 
Which girl is pushing, getting pushed by a 
car?(3;8) 

He got picked up from her. (3;11) 

It's getting ate up from Luke Skywalker. 
(4;0) 

Which giraffe gets huggen by Grover? (4;9) 

Note that these utterances contain a variety of 
morphological and other errors, but they all 
nevertheless exhibit the essential passive 
structure (underlying subject in pre-verbal 
position; agent in post-verbal prepositional 
phrase).^ It might be argued that the children's 
passives elicited in this experiment do not involve 
true A-chains. However, since they are just like 
adult passives (disregarding morphological 
errors), the burden of proof falls on anyone who 
holds *hat adult passives involve A-chains and 
children s> passives do not. No criterion has been 
proposed, as far as we know, which distinguishes 
adulf 8 and children's passives in this respect. For 
examole, it is true that the children almost always 
use a form of^e^ in place of the passive auxiliary 
be, but^e^ is acceptable in adult passives also. 
(Get is more regular and phonologically more 
prominent than forms of 6e, and this may be why 
it is more salient for children.) 

Children's considerable success in producing 
passive sentences appropriate to the 
circumstances (i.e., their correct pairing of 
sentence forms and meanings) constitutes 
compelling evidence of their grammatical 
competence with this construction. Cor.iparison of 
these results with the results of testing the same 
children with two comprehension paradigms (act- 



out and picture-verification) confirms that, like 
spontaneous production data, these measures 
underestimate children's linguistic knowledge. 

The finding that young children evince mastery 
of the passive obviates the need to appeal to 
maturation to account for its absence in early 
child language. Maturation cannot of course be 
absolutely excluded; but a maturation account is 
motivated only -where a construction is acquired 
surprisingly late— where this means later tha*. 
would be expected on the basis of processing 
complexity, pragmatic usefulness in children's 
discourse, and so forth. (Also, as noted in section 
1, some important cross-language and cross- 
construction correlations nf^ed to be established to 
confirm a maturational approach; see Borer & 
Wexler» 1987, on comparison of English passives 
with passive and causative constructions in 
Hebrew.) The elicited production results suggest 
that the age at which passive is acquired in 
English falls well within a time span that is 
compatible with these other factors, and so 
maturation does not need to be invoked. 

3.3* Wanna contraction. Another phenomenon 
that can be shown by elicitation to appear quite 
early in acquisition is wanna contraction in 
English. The facts are shown in (42) and (43). 

(42) a. Who do you want to help? 
b. Who do you wanna help? 

(43) a. Who do you want to help you? 
b. Who do you wanna help you? 

Every adult is (implicitly) aware that 
contraction is admissible in (42b) but not (43b). 
However, on the usual assumption that children 
do not have access to 'negative data' (i.e., are not 
informed of which sentences are ungrammatical) 
it is difficult to see how this knowledge about the 
ungrammaticality of sentences like (43b) could be 
acquired from experience (at any age). So this is 
yet another candidate for innate linguistic 
knowledge. (What is known innately would be 
that a trace between two words prevents them 
from contracting together. The relevant difference 
between (42b) and (43b) is that in (43b) the who is 
the subject of the subordinate clause and has been 
moved from a position between the want and the 
to. The trace of this noun phrase that is left 
behind blocks the contraction. In (42bj. by 
contrast, the trace is in object position after help, 
and therefore is not in the way of the contraction.) 

Grain and Thcnton (in press) used the elici* \ 
production technique to encourage children to ^ 
questions that would reveal violations like (43b) if 
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these were compatible with their grammars. The 
target productions were evoked by having children 
pose questions to a rat who was too timid to talk 
to grown-ups. The details of the procedure are 
illustrated in the following scenarios: 

Protocol for Object Extraction 

Experimenter. The rat looks hungry. I bet ho wants to 
eat something. Ask him what 

Child: What do you wanna cat? 

Protocol for Subject Extraction 

Experimenter One of these guys gets to take a wal e, 
one gets to take a nap, and one gets to cat a 
cookie. So one gets to eat a cookie, right? Ask 
Ratty who he wants. 

Child: Who do you want to eat the cookie? 

Using this technique, questions involving both 
subgect and obgect extraction were elicited from 21 
children, who ranged in age from 2; 10 to 5;5, with 
an average age of 4;3. The preliminary findings of 
the experiment are clearly in accord with the 
expectations of the innateness hypothesis, 
although we must verify our own subjective 
assessment of these data using a panel of 
judges. 1^ In producing obgect extraction questions 
(which permit contraction in the adult grammar), 
children gave contracted forms 59% of the time 
and uncontracted forms 18% of the time. (There 
were 23% of other responses not of the target 
form, such as What can you eat to see in the dark?) 
By contrast, children's production of subgect 
extraction questions (where contraction is illicit) 
contained contracted forms only 4 of the Lime and 
uncontracted forms 67% of the time (with 29% of 
other responses). 

The systematic control of this subtle contrast 
could perhaps hava been shown on the basis of 
spontaneous production data, but the crucial 
situations (particularly those that call for subgect 
extraction questions) probably occur quite rarely 
in children's experience, just as they do in the case 
of the full passive. So it is not ea^ to gather data 
in sufficient quantity for statistical analysis. By 
contrast, the elicitation technique is obviously an 
efficient way of generating data, and thus 
facilitates testing for early acquisition of a variety 
of constructions relevant to the innateness 
hypothesis. 

4. CONCLUSION 
In this paper we have reviewed a great many 
empirical studies. The thread that ties them 
together is the idea that, when performance 



problems are minimized in testing situations, 
children show early knowledge of a wide range of 
basic constructions. As early as 1965, Bellugi 
suggested that children's errors on Wh-questions 
were due, not to a lack of knowledge of th^ two 
relevant transformations (Wh-movement and 
Subject/Auxiliary Inversion), but to a not yet fully 
developed capacity to apply both rules in the same 
sentence derivation. Our work extends this 
general idea to a broader set of linguistic 
phenomena. Our particular emphasis has been 
constructions which linguistic theory predicts 
should require little or no learning because they 
involve principles which are universal and hence 
innate. Our findings suggest that the innateness 
hypothesis for language is still secure even in its 
simplest form (in which different innate principles 
are not timed to mature at different 
developinental stages). Maturation of 
nonlinguistic abilities appears to be sufficient to 
account for the time course of linguistic 
deveIopr>3nt 

REFERENCES 

Amidon, A., k Carey, P. (1972). Why five-year-olds annol 

tinderstond before and after. Joumd of Verbal Learning and Verbal 

fiehfliwrjl, 417-423. 
BcUugi, U. (1965). The development of interrogative structures in 

children's speech. In K. Riegel (Ed), The dexfetopment of language 

functions. University of Michigan Language Development 

Program, Ann Arbor, Report No. 8. 
Borer, K . k VVexler, K. (1987). The maturation of syntax In T. 

Roeper k E WUliams (Eds.), Parameter setting. Dordrecht, 

Holland: D. Reidd Publishing Company. 
Chomsky, N. (1971). Problems of knowledge and freedom. New York: 

Pantheon Books. 

Chomsky, N. (1986). Knowledge of language: Its nature, origin, and 

use. New York: Praeger. 
CHark, E V. (1971). On the acquisition of the meaning of before and 

after. Journal of Verbal Learning and Verbal Behavior, 10, 266-275. 
Qark, E. V., & Garnica, O. K. (1974). Is he coming or going? On 

the acquisition of deictic verbs. Journal of Verbal Uammg and 

Verbal Behavior, IS, 559-572. 
Crain, S. (1982). Temporal terms: Mastery by age five. Papers and 

Reports on Child Language Development, 21, 33-38. 
Crain, S., k Fodor, J. D. (1984). On the innateness of Subjacency. 

Proceedings of the Eastern States Conference on Linguistics (Vol. 1). 

Columbus: Ohio State University. 
Crain, S., k McKcc, C. (1985) Acquisition of structural restrictions 

on anaphora. Proceedings ^fthe North Eastern Linguistic Society, 

16. Amherst: University of Massachusetts. 
Oain, S., k Nakayama, U. (1987). Strf;cture-dependence in 

grammar formation. Language, 63,522-543. 
Oain, S., Thornton, R., k Murasugi, K. (1987). Capturing the 

evasive passive. Paper presented at the 12th Annual Boston 

University Confcr&icc on Language Development. 
Crain, S., k Thornton, R. (in press). Rccharting the course of 

language acquisition: Studies in elicited production. In N. 

Krasnegor, D. Rumbaugh, R. Schiefeibusch, k M. Studdert- 

Kennedy (Eds.), Biobehavioral foundations of language 

development. Hillsdale, NL Lawrence Erlbaum Associates. 



ERIC 



U2 



Competence and Performance in Child Lmguase 



135 



Fnzier, L, It Fodor, J. D, (1978). The MUMge machine: A new 
twMtige parting jnodd. Cognition, 6, 291-325. 

Goodluck, a (1966). ChUdren't interpretation of pronouns and 
null NPte Structure and strategy. In P. Fletcher ic M. Carman 
CEdt.), Unguage aaiutsUion; Studies in first language dmiopment 
Onded,), Cambridge, MA; Cambridge University Press. 

GoodlucK H„ k Tavakolian, S. (1982). Competence and 
processing in dOldren's grammar of relative clauses. Cognition, 
389-416. 

Gorrdl P., Crain, S., k Fodor, J. D. (1989). Contextual information 

and temporal terms. JoumalofChiU Language, 26,623>632. 
Hamburger, H., U Oain, S. (1982). Relative acquisition. In S. 

Kuczaj (Ed.), Language development (Volume II pp. 245-274). 

Hillsdal<^ hq: Lawrence Eribaum Associates. 
Hamburger, H., k Crain, S. (1984). Acquisition of cognitive 

compiling. Cognition, 17, 85-136. 
Hamburger, H., k Crain, & (1987). ?iua and semantics in human 

processing of language. CognUirx Science, 11, 10M36. 
Hw, J. R. (1981). The deoehpmtnt of structural principles relaUd to 

complement subject interpretation. Doctoral dissertation. The Qty 

University of New York. 
Jakubowia; C (1984). On markedness and binding principles. 

Proceedings of the Northeastern Linguistic Society, Amherst, 

Massadiusett^ 

Johnson, H. (1975). the meaning of mbre and i^forprcsdiool 
ihadm^loumal of Experimental Child Psydiology, 19. 

Kimball ). (1973). Seven principles of sur^ structure parsing in 
natural language. CognUion, 2, 15-47. 

Lasnik, R, k Crain, S. (1985). On the acquisition of pronominal 
reference, Ungua, 65, 135-154. 

Lust, B. (1981). Constraint on anaphora in child Ituiguagc: A 
prediction for a universal In S. Tavakolian (Ed.), Language 
acquisition and linguistic theory (pp. 74-96). Cambridge, MA; MIT 
Ptess. 

Matthei, E M. (1981). Children's intjrpretatbns of sentences 
containing reciprocals. In S. Tavakolian (Ed.), Language 
aaiuisUion and linguistic theory (pp. 97-115). Cambridge^ MA: 
MIT Press. 

Matthei, E M, (1982). The acquisition of prenominal modlHcr 

sequences. CognUkm, 11, 3(n^l 
Nakayama, M. (1987). Performance factors in subject-aux 

inversk)n by chiWrcn. foumd of Quid Language, 14, 113-125. 
Otsu, Y, (1981). Universal grammar and syntactic development in 

diildren: Toward a theory of syntactic development. Unpublished 

doctoral dissertation, M.I.T. 
Phinney, M. (1981) The acquisition of embedded sentences and 

the NIC Proceedings of the North Eastern Linguistic Society, 11. 

Amherst: University of Massachusetts 
Richards, M, ' (1976). Come and go reconsidered: Children's use 

of deictic verbs in contrived situations. Journal of Verbal 

Learning and Verbal Behavior, 15, 655-665. 
Roeper, T, W. (1972). Approaches to a theory of language acquisition 

with examples from German children. Unpublished doctoral 

dissertation. Harvard University. 
Roeper, T. W. (1986). How diiUren acquire bound variables. In E. 

Lust (Ed.), Studies in the acquisition of anaphora. Volume /. 

Dordrecht, Holland: D. Rddcl Publishing Company. 
Solan, L (1983). Pronominal reference: Child language and the 

theory of grammar. Dordrecht, Holland: D. Rddcl Publishing 

Company. 

Solan, L, k Roeper, T. W. (1978). Children's use of syntactic 
structure in interpreting rdative dauses. In H. Goodluck k L. 
Solan (Eds,), Papers in the Structure and Development of Child 



Language, UMASS Occasional Papers in Linguistics (Vol 4, pp. 
105-126). 

Tavakolian, & L (1978). ChUdren's comprehension of pronominal 
subjects and missing subjects in complicated sentences. In H. 
(Soodluck k L Solan (Eds.), Papers in the Structure and 
Devdopment of Child Language, UMASS Occasional Papers in 
Linguistics (Vol 4, pp. 145-152). 

Tavakolian, S. L (1981). The conjoined-dause analysis of rdative 
dauses. In S. Tavakolian (Ed), Language acquisition and linguistie 
theory (pp. 167-187). Cambridge, MA: MIT Press,. 

de Villier^ J. C, k de VUUers, P. A. (1986). The acquisition of 
English. In D. Slobin (Ed), The crosslinguistic study of language 
acquisUian Volume I: The data, Hillsdale, NI: Lawrence Eribaum 
Associates. 

Wexler, K., k Chien, Y. (1985) The devdopment of lexical 
anaphors and pronouns. Papers and Reports on Chad Lmguage 
Development, Stanford, CA: Stanford University. 

FOOTNOTES 

^Language and cognition: A developr^ental perspective (iri press). 
Norwood, NI:Ablex. 
tAlso University of Connecticut, Storrs. 
^Also Graduate Center, Qty University of New York. 
ISusan Carey has pointed out (Boston University Conference, 
1988) that the linguistic maturation hypothesis predicts that 
knowledge of a linguistic prindple should correlate with 
gestation age rather than with birth age in children bom 
prematurely. Unfbrtunatdy, variability is probably such that 
no dear correlation coukl be expected to show itsdf o/age4or 
5, when passives and other rdevant syntactic constructions are 
claimed to emerge. 

^or full details of procedure and results of this experiment and 
of all other studies reported in this paper, we refer reac.ers to 
the origind publications. 

3As far as is known at present, no natural language exhibits this 
Uanket prohibition against backward pronominslization; sec* 
discussion in Lasnik and Crain (1985). This suggests that it is 
not a possible constrdnt in a naturd language grammar, in 
which case it should not be en tcrtaincd by chUdren at any age 
or stage of acquisition (unless one assumes linguistic 
maturation). 

^Thc awkwardness of the prosodic contour for (18), with its 
heavy juncture before the Anal word, may indicate that this 
kind of construction is also an unruitural one for the sentence 
production routines. 

^In reviewing the literature on rdative dauses, de ' illiers and 
de Villicrs (1986) suggest that if earlier work had counted the 
assertion-only response as coned, children would have been 
seen to perform better there too. This objection is unwarranted, 
for two reasons. First, responses of thii type did not appear in 
other studies, presumably because these studies failed to ^Aoet 
the presuppositions of the restrictive relative clause. More 
important, in the Hamburger and Oain study this response 
was not evinced by any of the S-year-oId children, and 
accounted for only 13% of the responses of the 4-year-oids. 
Nevertheless, even the 3-ycar-oIds acted out sentences with 
rdative dauses at a mudi higher rate (69%) of success than in 
earlier studks. 

^These results should be interpreted with caution due to the 
small and unequal number of subjects in each subgroup 
leading to rather uneven data. For example, the older F group 
performed relativdy poorly compared to the younger F group, 
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though closer analysis reveals lhal this is due to the poor 

perfdnnanoe of just one diild (M) in the F group. 
'There were no main-dause^cnly errors in the FI group. For the 

F group, 12 of the 16 maln-dause-only errors (out of 168 

re^onses) are due to one child. 
*An A-chain is the association of a trace with a moved noun 

phrase in an A-position (s argument position sudi as Subject). 

For example; in The bagd vm emUn by BUI there Is sn A-chain 

oonsistlnf ' (to tis^ and its associated trace after caim. 



The proper reversal of underlying subject and object order 
occurred even when the task was complicated by an 
implausible scene to be described. For example, the sentence 
One dinosaufs bebig eaUd from the ice cream cone was used to 
describe a situation in which the dinosaur was indeed being 
eaten by the ice oreariv not vice versa. 

preliminary evaluation of the audio tapes, we have found it 
unexpectedly easy to distinguish children's contracted and 
non-contracted forms in most cases. 
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A labBhi^ test with synthetic speech stimuli was carried out to determine to what extent 
ttje two dimensions of fundamental frequency (Po), height and movement, and syllable 
durabon prowde cues to tonal distinctions in Taiwanese. The data shov Iiat the high level 
vs. nud level tones and the high falhng vs. mid &lUng tones can be reliably distinguished 
bj Fo height done, whereas the distinction between tones with dissimilar contours, such 
as the high falhng and low rising tones, is predominantly cued by Po movement. However, 
the other dimension of Po may collaborate with the dominant one in cueing a tonal 
K^^^ °" the extent to which the two tones differ along that dimension, 
a !r ••mall additional effect on the perception of the distinction between 

falling and nonfalhng tones. These results are consistent with previous findings in tone 
languages other than Taiwanese in that they suggest that tones ai» mainly cued by Po. 
K*.StS!J^£ ° dimensions as cues to tonal contraste depends on the contrast to 
be dis^nguished, the preaent fincUngs show that tones which nominally differ only in 
waster (e.g., high falhng vs. mid falhng) y exhibit perceptually relevant contour 
oiiiar^ncM, ar i vice versa. 



INTRODUCTION 

The primary acoustic attribute distinguishing 
linguistic tones is their fundamental frequency 
(Fo), although duration and amplitude of the 
syllable carrying the tone may also exhibit 
characteristic differences. This observation raises 
the question of whether, in perceiving a phonemic 
tone, listeners integrate all these acoustic cues, c-- 
whether they pay attention to Fo alone. 

Several studies have investigated the role of Fo 
versus other properties as cues to tonal 
distinctions, in both synthetic and natural speech. 
For example, Abramson (1962) imposed artificial 
Fo movements on natural Thai monosyllables by 
means of a vocoder and found that Fo 
overpowered other concomitant features such as 



This mearch. which formed part of the Hrst author's 
doctoral disMrUtion, wa« supported by NICHD Grant HD- 
01994 to Haskins UboratoHet. Special thanki are due to 
Arthur Abranuon. Carol Fowler, and Ignatius Mattingly for 
their helpAil oommenU on an earlier version of this paper, and 
to Jackson Candour end Eva Girding for serving u reviewers 
for Language and Speech, 



duration and amplitude in cueing tonal 
distinctions. The primacy of Fo was confirmed in a 
later study using synthetic Thai speech 
(Abramson, 1975), though addition of natural 
amplitude contours improved identification 
further. The conclusion that Fo carries sufficient 
information for conveying tonal distinctior has 
also been drawn by Howie (1976), Tseng (1981), 
and M,-C, Lin (1987), who investigated the tones 
of Mandarin Chinese, However, these studies 
either did not vary duration and amplitude at all, 
or they pitted these dimensions against 
unambiguous Fo contours. In whispered 
monosyllables, where Fo is altogether absent but 
duration and amplitude differences may be 
retained to some extent, tonal distinctions arc 
resolved poorly, though above chance ievel (e,g., 
Ahramson, 1972; Howie, 1976), It is conceivable 
thrt, in addition to increasing the naturalness of 
utterances (cf, M,-C, Lin, 1987; Rumyantsev, 
1987), duration and amplitude have larger effects 
on tone identification when Fo provides 
ambiguous information. Also, the relative 
informativeness of different acoustic cues for tonal 
identity may vary across languages. 
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The Fo dimension itself may be decomposed into 
two aspects: height and movement.^ The relative 
importance of these two aspects obviously depends 
on the nature of the tonal distinction to be made: 
If the distinction is between two tones differing 
primarily in raster (e.g., Tiigh* vs. *now*), Fo 
height will be important; if two tones differ 
primarily in contour (e.g., •risini^ vs. •Tailing^, Fo 
movement will be the dominant cue. However, for 
tones that, according to lingtiistic nomenclature, 
differ only in register or in contour, the other 
aspect of Fo might play a secondary role in cueing 
the contrast Furthermore, for tones that differ in 
both register and contour (e.g., Tiigh falling^ vs, 
*low rising*), both Fo hei^t and movement may 
be relevant, though perhaps not equally 
important Their relative Importance may depend 
on what other tones there are in the language. 

With these issues in mind, we conducted the 
present study to determine the relative 
importance of Fo height, Fo movement, and 
duration as cues to the tonal distinctions of 
Taiwanese. From traditional classifications by 
phonologists and from acoustic studies (Chiang, 
1967; H.-B. Lin, 1988; Zee, 1978) it is evident the 
Taiwanese has five long tones, whose typical Fo 
contours are illustrated in Figure 1.^ They fall 
into two classes: Tones 1 C^igh level") and 5 C^id 
leveD are fairly level or static, whereas tones 2 
("high falling"), 3 C'mid falling^, and 4 Clow 
rising") are contoured or dsmamic. Two pairs of 
tones, 1 vs. 5, and 2 vs. 3, have similar Fo 
contours but differ in register. We would thus 
expect Fo height to play a primary role in the 
distinction of these tone pairs. As for the tonal 
pairs with different Fo contours, we expected that 
Fo movement rather than height would be 
responsible for cueing the differences. However, 
we wondered whether the "other* aspect of Fo 
would make a contribution to a distinction as well. 
For example, would the small difference in Fo 
movement between the two level tones, 1 and 5, or 
the larger one between the two falling tones, 2 and 
3, play any role in perception at all? And would Fo 
height be relevant to the distinction, for instance, 
between tones 1 and 4, even though th^ differ in 
contour? The answers to these questions seemed 
less obvious. 

With respect to duration, the five Taiwanese 
long tones fall into two groups: relatively long 
(tones 1, 4, and 5) and relatively short (tones 2 
and 3). In isolated syllables, the respective 
average durations were 145 ms and 75 ms (H.-B. 
Lin, 1988; see Figure 1). Thus, in principle. 



duration could be a rather strong cue for the 
distinction between falling and nonfalling tones. 




DURATION (ms) 



Figure 1. Avenge Fo movemenU of five Taiwanese tones 
produced by duee male speaken on the syllable /do/ in 
sentence-final position. (DaU from H.-B« Lin, 1988.) 

Our approach was to synthesize a variety of Fo 
patterns by varying Fo height, Fo movement and 
duration independently in isolated syllables.^ In 
some of our stimuli, the Fo heights, movements, 
and durations of typical Taiwanese tones were 
juxtaposed in novel combinations, so the relative 
strength of these competing cues could be 
assessed. In addition, we synthesized Fo patterns 
intermediate between those of the original tones 
in terms of Fo height and/or movement. These 
relatively ambiguous stimuli provided the best 
opportunity to observe effects of secondary cues, 
such as duration or the ''other^ Fo dimension, on 
perception of a given tonal contrast. Even though 
our stimuli were presented singly, we 
conceptualized the study in terms of pairwise 
tonal contrasts, for heuristic convenience. 
Listeners, of course, always had all five lexical 
alternatives in mind as they tried to identify our 
synthetic syllables. 

METHODS 

Materials 

The syllable /do/ was modelled after a natural 
Taiwanese utterance on the software serial 
formant synthesizer at Haskins Laboratories. 
Since the Taiwanese /d/ is unaspirated and 
voiceless (i.e., [t]), only a 10-ms release burst 
preceded the onset of voicing. The onset 
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frequencies of the first three formants were 
450 Hz, 1160 Hz, and 2400 Hz, After 70 ms, they 
reached respective steady states of 560 Hz, 760 
Hz, and 2000 Hz. The amplitude of the voicing 
source v as kept at a constant vahie. The extreme 
Po values and durations of the five basic tones are 
shown in Table 1. The actual Fo movements 
followed the natural models (cf. Figure 1) as 
closely as possible. 

Table 1. Average Fo onset and offset values (in Hz) and 
djfrtttions (in ms) of the five synthetic tones on IdoL The 
"pivot" indicates the turning point of the Fo movement 
in tone 4. 



Tone 


onset 


pivot 


offset 


duration 


1 


IM 




129 


145 


2 


ISO 




96 


75 


3 


109 




80 


75 


4 


102 


94 


105 


145 


5 


113 




107 


145 



To assess the relative importance of Fo height, 
Po n*ovement» and duration cues in the perception 
of tonal distinctions, these properties were varied 
independently within each pairwise contrast. 
Thus we synthesized, in addition to the original 
tones, stimuli in which the Fo movement of tone X 
was combined with the Fo height of tone Y, and 
vice versa. This involved a translation of the 
whole Fo movement up or down the linear 
frequency axis, hy adding a positive or negative 
constant to all Fo values in the synthesis 
specifications. We defined Fo height operationally 
as the onset frequency of a tone.-* In addition, we 
created an F. movement intermediate between 
the two original tonal contours by averaging their 
Fo values, aiid we chose an intermediate Fo onset 
frequency as well. Thus we had three Fo 
movements (X, Y, and intermediate) and three Fo 
heists (X, Y, and intermediate), all combinations 
of which resulted in nine stimuli for any given 
tonal contrast The stimulus set for the tone 2-4 
contrast is illustrated in Figure 2; tone 24 denotes 
the stimulus intermediate between tones 2 and 4 
in both Fo height and Fo movement.^ 

Given five original tones, there were 10 pairs of 
tonal contrast. In four of these (2-3, 1-4, 1-5, 4-5), 
the durations of the two original tones were the 
same, and so all nine stimuli were synthesized at 
the same duration. In the six other contrasts (1-2, 
1-3, 2-4, 2-5, 3-4, 3-5), the two original tones had 
different durations (cf. Table 1). For these, we 



synthesized the set of nine stimuli at three 
different durations, the two original ones (75 and 
145 ms) and one intermediate one (110 ms), which 
resulted in 27 stimuli. When the duration of an 
original Fo movement was changed, its onset and 
offset frequencies were maintained, but the Po 
triuectory was stylized by linear interpolation 
between the extreme values. In the case of tone 4, 
the location of the pivot was moved in proportion 
to the duration change, and two linear 
interpolations were performed. 



r r 




i 40 ' 1 1 1 1 

0 50- 100 

^ DURATION (%) 



Figure 2. Example of a stimului set for a particular tonal 
contrast: tone 2 venus tone 4« The original (2 and 4) and 
intermediate (24) Fo movements are shown by the heavy 
lines; the other patterns were obuined by changing the 
onset frequencies of these three Fo movements* (Sec also 
Footnote 5«) 

In all, 6 X 27 + 4 X 9 = 198 stimuli were created 
on the synthesizer, though a number of these were 
identical or closely similar to each other.^ The 
stimuli were recorded in five different 
randomizadons, each on a separate audio tape. On 
each tape, there were six blocks of 33 stimuli, 
separated by 8 sec of silence. There was a 3.5 sec 
interstimulus interval within blocks. Before the 
presentation of the test tapes, subjects had a 
chance to familiarize themselves with the original 
synthetic tones on the /do/ syllable. The ord^r of 
test tapes was counterbalanced across subjects, 
and the experimental session took about 90 
minutes. 

Subjects 

Four male and four female listeners were 
recruited from University of Connecticut and Yale 
University graduate students and were paid for 
their participation. All were native speakers of 
Taiwanese and reported to have no history of 
speech or hearing disorder. Like all educated 
Taiw<.nese, they were also fluent in English and 
Mandarin Chinese.7 
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Procedure 

The stimuli were presented binaurally over 
headphones at a comfortable listening level. 
Listeners were instructed to identify each /do/ 
syllable as (1) 'ciiy/ (2) •gambling/ (3) 'envy/ (4) 
*map/ or (5) 'surname* by writing down the 
r4umber of the choice. The response choices were 
listed on the answer sheet in Chinese characters. 
Subjects were told not to leave any blanks. 

RESULTS 

The results for the 10 tonal contrasts will be 
discussed in the following order: First, the two 
contrasts that have nominally identical contours 
but differ in register (1-5, 2-3); then, two contrasts 
that have similar Fo movements and differ in Fo 
height (4-5, 1-4); finally, the remaining six 
contrasts which have very dissimilar Fo 
movements (as well as differences in duration), 
grouped into three pairs exhibiting different 
magnitudes of Fo height difference (2-1, 2-5; 5-3, 
2-3; 3-4, 2-4). In naming the members of a tonal 
pair, the one with the higher onset frequency is 
always named first (hence 2-1 and 5-3). 

Tables 2 to 11 show the response distributions 
for the stimuli pertaining to each of the 10 pairs of 
tonal contrast In each table, stimuli are coded in 
terms of Fo movement and height with two-digit 
numbers standing for the intermediate values on 
these wo dimensions (e.g., 15 is intermediate 
heH 1 the original tones 1 and 5). To simplify 
the tables, the results have been averaged across 
stimuli differing in duration; effects of stimulus 
duration will be discussed later. Thus, each 
stimulus set includes nine types. The average 
recognition rate for the five synthetic syllables 
modelling the original tones was 91% correct, 
which confirms that these stimuli were 
satisfactorily synthesized (cf. Footnote 3). 

Tones with the same contour, differing in 
register 

Presumably, the more pronounced the difference 
ni an acoustic property between two tones, the 
more important it will be as a cue to the tonal 
distinction. If, moreover, differences along other 
psychoacoustic dimensions are small, it will 
emerge as the dominant cue. Thus, for pairs such 
as tones 1 vs. 5 and 2 vs. 3, which norainally have 
the same Fo contour but differ in register, Fo 
height was expected to be the dominant cue. 
However, tones 2 and 3, at least, also exhibit a 
difference in Fo movement (see Figure 1), and we 
wondeied whether that dimension would 
contribute to the perceptual distinction. 



Tone 1 vs. tone 5. Tones 1 (high level) and 5 
(mid level) have almost identical, flat Fo 
movements; the difference between them is in Fo 
height. The data in Table 2 reveal that Fo height 
indeed plays a primary role in the perception of 
this distinction. Movement 1 with height 5 was 
identified predominantly as tone 5, whereat! 
movement 5 with height 1 was classified as tone 1. 
There were virtually no confusions with other 
tones. Did Fo movement have any cue value at all? 
The stimuli with intermediate Fo height, which 
should have been the most sensitive indicators of 
Fo movement effects, suggest a negative answer. 
However, movement 1 with height 5 received only 
70 tone 5 responses, whereas the original 
synthetic tone 5 received 90. This difference 
notwithstanding, the effect of Fo movement on the 
perception of this tonal distinction seems 
negligible simply because the difference between 
the tones is minimal on this dimension. 

Table 2. High level tone (1) vs. mid level tone (51 



Stimuli 




RcqKNiacs(%) 




lAa^ Hgt 


1 


2 3 


4 


5 


1 1 


80 






20 


15 


57.5 




25 


4) 


5 


30 






70 


15 1 


80 




25 


17^ 


15 


615 






37J 


5 


25 


15 




95 


5 1 


90 




25 


7.5 


15 


515 


25 




45 


5 


15 


25 




90 



Tone 2 vs. tone 3. The falling tones 2 and 3 
nominally have the same Fo contour, though Fo in 
tone 2 falls somewhat more steeply than tone 3. 
The main difference between them is in Fc* height 
The data in Table 3 confirm that Fo height is the 
msgor cue for the distinction: Movement 2 with 
height 3 was identified mostly as tone 3, and 
movement 3 with height 2 as tone 2. However, 
there was also some effect ^f Fo movement: At 
each of the three Fo heights, more tone 3 
responses were obtained when the movement 
derived from tone 3 rather than from tone 2. Thus, 
even though both tones are considered merely 
''falling* in traditional phonological terminology, 
there is in fact a perceptually relevant difference 
in Fo movement between them. The difference in 
Fo height, however, is clearly the dominant cue. 



ERIC 



Us 



Cues to the Perception of Taiwanese Tones 



141 



Table 3. High falling tone (2) vs. mid falling tone (3). 



Stimuli 
Mov Hst 


1 


2 


3 4 


5 


2 2 
23 
3 




9U 
915 

225 


73 

723 


5 


23 2 
23 
3 




925 
» 

173 


IS 

173 

80 


2S 
23 

23 


3 2 
23 
3 


75 


fS 
673 
73 


273 
323 
923 





Tones with simUar contours, differing in 
register 

The contour of tone 4, referred to here as 
"rising* , is actually a complex falling-rising or 
dipping Fo movement with a relatively limited 
range (cf. Figure 1). Because of that limited range, 
it looks somewhat similar to the flat contours of 
tones 1 and 5, though to the ear it may be quite 
dissimilar. In contrasting tones 5 vs. 4 and 1 vs. 4, 
which differ in both Fo register and contour, we 
expected both dimensions of Fo to be perceptually 
elevant. The question was which of them would 
carry more weight 

Tone 4 vt. tone 5. Tones 4 Oow rising) and 5 
(mid level) are relatively similar in Fo movement 
during the first half of their durations, but during 
the second half tone 4 moves up and almost 
merges with tone 5. There is also a differ 
ence in Fo height, tone 4 having a lower onset 
than tone 5. However, the data in Table 4 
reveal Fo movement to be the primary cue: 



Table 4. Af id level tone (5) vs. low rising tone (4), 



StlmuU 
Mov Hst 


1 


Responses (%) 
2 3 4 


5 


5 5 
54 
4 


5 

25 


5 


25 


90 
90 
97J 


54 5 
54 
4 




25 
25 


65 
70 
95 


315 
275 
5 


4 5 
54 
4 


25 




100 
975 
100 





High intelligibility of both tones was maintained 
when their original Fo movements were presented 
at uncharacteristic hei|^ts. An effect of Fo height 
emerged only when Fo movement was 
intermediate. Evidently, Fo movement is the 
dominant cue to this distinction. This is also 
suggested by the finding that tone 4 responses 
predominated for the intermediate Fo movement: 
Detection of even a slight ''dip'' was sufficient to 
elicit tone 4 percepts. 

Tone 1 vs. tone 4. Since tone 1 (high level) is 
very similar to tone 5 in Fo movement but higher 
in register, the Fo movement difference be 
tween tones 1 and 4 is similar to that between 
tones 5 and 4, just discussed, only the difference 
in height is larger. Does this imply a larger role of 
Fo height in cueing the distinction? Table 
suggests an affirmative answer, but it is eviden. 
that Fo movement is still the dominant cue: 



Table S* High level tone (1) vs. low rising tone (4). 



Stimuli 
Mov Hst 


1 


Responses (%) 
2 3 


4 


5 


1 1 
14 
4 


815 
25 


25 2.5 


10 


175 

75 

85 


14 1 
14 
4 


75 
15 




215 
€75 
100 


25 
25 


4 1 
14 

A 


15 


25 


815 
100 


25 



Movement 4 with height 1 was still predominantly 
identified as tone 4, and movement 1 with height 
4 was identified as tone 5, which shares the 
contour with tone 1. A small effect of Fo height is 
evident with the original tone 4 movement: At a 
very high Fo, some tone 1 responses did occur. A 
large effect of Fo height was obtained for the 
intermediate movement stimuli, whose 
identification changed from tone 1 to tone 4 as the 
height was lowered. This shift is larger than that 
observed in Table 4, in accord with the larger 
height difference for the present contrast. 
Tones with dissimilar contouis 

The falling Fo contour (tones 2 and 3) is 
acoustically and perceptually very dissimilar from 
the level and rising contours; it is also carried on a 
shorter syllable, though differences in duration 
will be ignored for the time being. Since Fo 
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movement proved to be perceptually important 
even for distinguishing tones with relatively 
similar contours, we certainly expected that tonal 
contrasts involving falling tones would be 
primarily cued by Fo movement, with secondary 
contributions of Fo height depending on the 
amount of the difference. In fact, it will be seen 
that, because for each falling tone and each level 
tone in Taiwanese there is a similar tone differing 
in register, stimuli with altered Fo height were 
often identified as tones other than those in the 
particular contrast under consider tion. 

Tone 2 vs. tone 1. Tones 2 Giig.i falling) and 1 
(high level), though they are both nominally 
liig^*, do show a difference in onset frequency. 
The m^at striking difference, however, is in their 
Fo movements. As the data in Table 6 show, Fo 
movement is indeed the dominant cue to the 
contrast: Movement 2 with height 1 was still 
mostly identified as tone 2, whereas movement 1 
with height 2 was recognized as tone 1. The 
stimuli with the intermediate Fo movement 
showed some effects of Fo height on their 
identification as tone 1, but the effect was such 
that tone 1 responses increased as the height was 
raised, even though the higher onset frequency 
derived from tone 2; there was no effect of hei^t 
on tone 2 responses. The paradoxical effect of Fo 
height derived from a tendency to identify the 
intermediate Fo movement as tone 3, which 
indeed has a contour intermediate between tones 
2 and 1, and occasionally even as tone 5; both of 
these tendencies to give mid-register tone 
rt;.^ ^nses increased as Fo was lowered, at the 
expense of tone 1 responses. Basically, Fo height 
seems to be irrelevant to the perception of the tone 
2 vs. 1 contrast. 



Table 6. High falling tone (2) vs. high level tone (1). 



StImuU 






Responses (%) 




Mov Hst 


1 


2 


3 4 


5 


2 2 




96.6 


33 




21 






5 




1 




87J 


\25 




21 2 


25.8 


S5 


183 


08 


21 


\25 


61.6 


215 


33 


1 


6.6 


59.1 


25.8 


83 


\ 2 


96.6 




1j6 


1j6 


21 


983 






1j6 


1 


92J 




0.8 


6.6 



Tone 2 vs« tone 5« The difference in Fo 
movement between tones 2 (hig^ falling) and 5 
(mid level) is the same as that between tones 2 
and 1, just discussed, but the difference in Fo 
height (onset) is much larger. The data presented 
in Table 7 show that predictable confusions 
occurred as Fo height was changed: Movement 2 
with height 5 was often identified as tone 3, while 
movement 5 with height 2 was obviously tone 1. 
Stimuli with intermediate movement were 
predominantly identified as falling tones (tones 2 
or 3, depending on Fo height), though at a very 
hig^ Fo (above the characteristic height of tone 1) 
some tone 1 responses occurred. In general, 
however, Fo movement remained the overriding 
cue for this falling versus level tone distinction. 



Table 7, High falling tone (2) vs. mid level tone (5). 



SdmuU 






Rcspooscs (%) 






Mov Hgt 


1 


2 


3 


4 


5 


2 2 




96^ 


33 






25 




833 


14.1 


OS 


1j6 


5 




34.1 


60 




5.8 


2S 2 


275 


51.6 


17.5 


1j6 


1j6 


25 


41 


61/ 


283 


08 


5 


5 




325 


60.8 


1.6 


5 


5 2 


775 


1j6 


08 






2S 


86.6 




25 




108 


5 


6.6 


0.8 


5 




87.5 



Tone 5 vs. tone 3. Here we have another level 
versus falling contrast, at a lower Fo height. 
Tones 3 (mid falling and 5 (mid level) originate at 
almost the same frequency, so no effect of Fo 
height was expected. The data in Table 8 confirm 
that the dominant cue for the tone 5 versus ione 3 
distinction is Fo movement, though the stimuli 
with the intermediate movement do show a small 
effect of Fo hei^^t The copies of the original ton'3S 
were not identified very well in this pair of tones; 
this reflects in part their inh(^rent confusability 
with tones having the same contour (tones 1 and 
2, respectively) but, in addition, the neutralization 
of duration cues and stylization of Fo movements 
may have increased \e number of confusions. 
This general ambiguity may have made listeners 
extra sensitive to small differences in Fo height 
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Table 8. Aff<f level tone (5) vs. mid falling tone (3). 



StfanuU 






Response 


»(%) 




Mov Hst 


1 


2 


3 


4 


5 


5 5 


llJS 


1j6 


33 


08 


823 


53 


41 




5^ 


1j6 


883 


3 


1.6 




73 


23 


883 


53 5 


1j6 


K) 


49.1 


1.6 


373 


53 




d6 


65^ 




273 


3 




75 


65.8 




26.6 


3 5 




233 


70^ 




5^ 


53 




15.8 


80 


08 


33 


3 




20 


673 




123 



Tone 1 vs. ton^ 8. Tones 1 (high level) and 3 
(mid falling) not oni;, Jiffer in Fo movement, as do 
tones 5 and 3, but aiso in Fo h<%ight. As can be 
seen in Table 9, subjects* responses changed 
considerably with both charges in Fo movement 
and in Fo height Ho. ,ver, changes in height led 
to prrMctable "confusions': Movenent 1 (h ^el) 
with ucight 3 (mid) was largely identified as tone 
6 (mid level), whereas movement 3 (falling) with 
height 1 (high) vms most often classified as tone 2 
Tiigh falling). Thesr responses thus occurred 
merely reflect the fact that Fo height cues the tone 
1 vs. tone 5 and tone 2 vs. tone 3 distinctions. The 
intermediate movement stimuli, however, do 
reveal a genuine effect of Fo height on the 
contrast between tones 1 and 3: Tone 1 responses 
decreased and tone 3 responses increased as 
height was lowered, with an increase in tone 5 
confusions in the middle. Thus it appears that 
bot^ Fo b'Jght and movement are important cues 
for uie distinction between tones 1 and 3, just as 
we expected. 



Table 9. High level tone (1) vs, mid falling tone (3). 



Stimuli 






Responses (%) 






Mov Hst 


1 


2 


3 


4 


5 


1 1 


85 




1.6 


1.6 


11.6 


13 


633 




33 




333 


3 


5 


0.8 


6.6 


9.1 


783 


13 1 


31.6 




123 


0.3 


31.6 


13 


6l6 


12o 


283 




523 


3 




73 


56.6 


0.8 


35 


3 1 


33 


69.1 


24.1 




33 


13 




45 


50 




5 


3 




IJ 


81.6 







Tone 3 vs. tone 4. There is a sinking difference 
in Fo movement between tone 3 (mid falling) and 
tone 4 How rising). However, the difference in Fo 
onset is ^inall. The data in Table 10 conf.*m that 
Fo movement is the primary cue for the 
distinction between these tones: Small changes in 
Fo height lefl the responses to the original Fo 
movements unchanged. The stimuli with the 
incermediate movement did show an effect of Fo 
height, despite the relatively small physical 
differences involved. However, the effect was less 
on identification of these stimuli as tones 3 or 4, 
but primarily on their identification as tone 5 (mid 
level). Indeed, the intermediate movemc it was 
relatively flat and thus could be mistaken for Ihe 
mid !^vel ; ^ne when its height was raised. For the 
tone d vs. tone 4 distinction, therefore, Fo height 
seems to be of litUe importance. 



Table 10. Mid falling tone (3) vj. low rising tone (4). 



StfmuU 






Responses (%) 




Mov Hst 


1 


2 


3 


4 


5 


3 : 




14.1 


80.8 




5 


34 




9.1 


85 




5^ 


4 




10 


833 




6.6 


34 3 


1.7 


1.7 


233 


5 


683 


34 


1.7 


08 


383 


233 


35.8 


4 


0.8 




373 


333 


283 


4 3 






33 


96.6 




3, 






5 


95 




4 






23 


973 





Tone 2 vs. tone 4. Tones 2 (high falling) and 4 
(low rising) show the sharpest Fo movement 
contrast of any tone pair, as well as the largest 
difference in Fo height. As can be seen in Table 
11, movement 2 with height 4 was identified as 
tone 3 (mid falling), which s not surprising. 
Interestingly, however, movement 4 with height 2 
was identified as tone 1 (high level), even though 
movement 4 with height 1 was not so identified 
(see Table 5). Thus, the movement barrier 
between tones 1 and 4 can be overcome by a 
sufficient raising of Fo height. The intermediate 
movement stimuli (somewhat falling with a dip) 
were never labeled as tone 4, but were highly 
ambiguous at a hfgh Fo and perceived as tone 3 at 
a low Fo. Clearly, both Fo movement and height 
are important for the tone 2 versus tone 4 
diiitinction, though the actual weights of these 
cues are difficult .x) gauge because of the intrusion 
of other responses. 
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Tabte 11. High falling tt/ie (2) vs. low rising tone (4). 



Stimuli 






Rcspomcs (%) 




Mov Hgt 


1 


2 




A 
«& 


5 


2 2 


08 


S6 


A 1 






24 




723 


a 




23 


4 




11.6 


823 




5.8 


24 2 




283 


20.8 




133 


24 


08 


323 


44.1 




223 


4 




33 


86.6 




10 


4 2 


86.6 


1.6 


08 


108 




24 


73 




41 


883 




4 


08 




4.1 


95 





Overview of Fo effects 

In the preceding discut^sion of pairwise tonal 
contrasts, stimuli included in one particular 
subset were often also relevant to the perception 
of other tones» as evidenced by the various 
response intrusions. It is useful, therefore, to 
survey the pattern of predominant identification 
responses for the complete set of stimuli, s^iM 
disregarding variations in duration. Table % 
provides such an overview. In its columns, Fo 
heights are arranged in terms of decreasing onset 
frequencies, and in its rows Fo movements are 



ordered in terms of decreasing differences between 
the onset and offset frequencies (with one minor, 
deliberate reversal). Each original height 
(movement) occurred with 9 L«fferent movements 
(heights), whereas each intermediate height 
(mo**ement) occurred with 3 different movements 
(heights). For each l«<^lgh^movement combination. 
Table 12 lists all tonal categories with nore than 
20 of responses, in rank order. 

The first five columns show that strongly falling 
Fo movements were perceived as either tone 2 or 
tone 3, depending on Fo height. The secondary 
I'elevance of Fo movement to this distinction can 
be seen in the fact that tone 3 responses increased 
as the steepness of the Fo movement decreased. 
The next four columns show that moderately 
falling Fo movements were perceived as tone 3 or 
tone 5 at the lower Fo onsets; stimuli with high 
onsets were not well sampled here, but suggest 
tone 1 responses with tone 2 as the second choice. 
The next Unree columns show that shallow falling 
Fo movements (with an onset-offset difference of 6 
Hz or less) were invariably identified as level 
tones: as tone 1 or as tone 5, depending ya Fo 
height Finally, the last three columns show that, 
when the curvature of tone 4 is imposed on a flat 
Fo movement, the stimuli were mostly heard as 
tone 4. This salient Fo movement cue was 
overridden only jy very h gh absolute Fo values, 
which favored tone 1 percepts. 



Tabk 12. Predominant response categories (with percentages exceeding 20%) for all combinations ofFo movemcMs 
and heights. Numbers in parentlteses represent onset frequencies (for Fo height) and the difference between onset and 
offset frequencies for Fo movement). 



Fo height Fo movement 

Type Onset (IIz) Type and onset-offset difference (Hz) 







2 


23 


1 

25 


3 


12 




35 


D 


34 


5 


15 


1 


45 


14 


4 






(45) 


(413) 


(30) 


(29) 


(27J5> 


(25.5) 


(17.5) 


(15) 


(13) 


(6) 


(33) 


(1) 


(13) 


(-1) 


(-3) 


2 


(150) 


2 


2 


Zl 


23 


2.1 


1.23 








1 




1 






1 


12 


(140) 


2 


























25 


(131.5) 


2 




23 
























1 


(130) 


2 






23 


23 






U2 




1 


1 


1 




\A 


4 


23 


(129.5) 


2 


2 




23 






















(126) 


23 


























4 


15 


(121.5) 


















U 




\5 








13 


(119.5) 








32 








53 
















14 


(116) 






















5,1 




4^ 


4 


5 


(1110 


32 




32 


32 






3i 






5 


5 


5,1 


4^ 




4 


25 


(lil) 








3 






3i 






5 












3 


(109) 


32 


3 




3 






3i 


3i 


53 


5 




5 






4 


45 


(1C7.5) 






















4^ 




4 


34 


(105.5) 




























4 


4 


(102) 


3 






3 




3 






3,4^ 


5 




5 


4 


4 


4 
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Duration as a cue 

Finally, we consider the possible role of duration 
as a cue to tonal distinctions. Because the two 
falling tones, 2 and 3, are characterized by shorter 
durations than the other tones (see Table 1), a 
short stimulus duration w^ expected to be a cue 
to the falling contour category. In addition, 
shortening the duration of a falling Fo movement 
increases its slope, which may fusilier enhance the 
"falling^ percept. This may favor tone 2 over tone 
3 responses for clea/ly falling contours, 
conridering the steeper slope of the tone 2 
movement (cf. Figure 1). 

The results were analyzed by comparing the 
response percentages for the short (75 ms) and 
long (145 ms) durations of each stimulus whose 
duration was varied. (The intermediate durations 
ware not considered.) Table 13, which is arranged 
in the same way as Table 12, lists all response 
categories in which a change of more than 10 
occurred as stimulus duration was shortened. 
(The larjest change was 36.) Plus signs indicate 
increases, minus signs decreases, in order of 
absolute magnitude. The changes are often 
complementary for two tonal categories, with one 
response increasing at the expense of another. 

Our expectations were that falling tone (2 and 3) 
responses would generally increase and level tone 



(1 and 5) responses would decrease as a 
consequence of stimulus shortening, but that for 
strongly falling Fo movements tone 2 responses 
might increase at the expense of tone 3 responses. 
The overall pattern of changes supports the first 
prediction: There were 11 increases in tone 2 
responses veisus 3 decreases, 13 increases in tone 
3 responses versus 3 decreases, one increase in 
tone 5 responses versus 12 decreases, and no 
increase in tone 1 responses versus 3 decreases. 
The second, more specific prediction was less well 
supported: In the left half of Table 13, increases ' 
tone 2 responses are frequen , but there are also 
some decreases, and changes in tone 3 responses 
are inconsistent, though usually complementary 
to the changes in tone 2 responses. On the whole, 
it appears that duration did have a role as a 
secondary cue for the falling-nonfalling tone 
distinction. 

DISCUSSION 

From these results, it is quite clear that Fo is 
the most prominent perceptual cue for tonal 
contrasts in Taiwanese, as in all other tone 
languages studied so far. The present data further 
show that either dimension cf Fo (height or 
movement) can emerge as the dominant factor in 
cueing a particular tonal contrast, depending 
on the tonal patterns to be differentiated. 



ible 13. Response categones showing changes in excess of 10 following a change in stimulus duration from 145 to 
ms. All relevant combinations of Fo heights and movements are shewn. 



Fo height Fo movement 

Type Onset (Hz) Type and onset-ofTset difTerencc (Hz) 







2 


25 


3 


12 




25 


13 


34 


5 


1 


4 






(45) 


(30) 


(29) 


(27 J) 


(25J) 


(17J) 


(15) 


(13) 


(6) 


! a) 


(-3) 


2 


(150) 








♦2 


•1,(W,.5 














12 


(140) 








♦2,3 
















25 


(131J) 
















-1 






1 


(130) 


^2 


♦2 


*2,.3 


>2 






-5,>2,*3 








23 


(129J) 






















2A 
15 


(126) 
(121J) 










>2,-5 












4.43 


D 
14 


(119J) 
(116) 














3 






•l.*5 




5 

35 
3 
45 


(113) 
(111) 
(109) 
(107J) 




+2 


^2 






.5,^3 
.5,^3 
.5,^3 


•5,>3 


-5*3 


-5 




4>3 


34 


(105J) 
















-5 








4 


(102) 










1 










♦3 
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In general* the more pronounced an acoustic 
difference between two tones is, the more likely it 
will be an important cue in perceiving the 
contrast Thus, Fo height was found to be a main 
cue in the distinction between tones with highly 
similar contours but different registers, whereas 
distinctions between tones with dissimilar 
contours were mainly cued by Fo movement 

On the whole, Fo movement seems to be 
perceptually more importarit than Fo height for 
Taiwanese listeners. This is especially evident in 
the contrasts between tones 1 and 4, and tones 5 
and 4, which in phiiciple could have depended on 
Fo height One reason for this finding may be that 
Fo movement is a more stable dimension than Fo 
height, which varies across speakers and is often 
ambiguous when no contextual reference is 
provided. (Note that tones 2 and 3, and tones 1 
and 5, tend to be confused in isolated syllables.) 
However, it has also been observed that linguistic 
background can affect the perception of Fo 
patterns (Gandour, 1978, IS^Z). Thus it could be 
that Fo height and movement are given different 
weights in languages with different tonal 
inventories, even when similar tonal contrasts are 
being perceived. For instance, in Cantonese, Fo 
height is rather important because four out of six 
tones are relatively similar in Fo movement 
(Vance, 1977). On the other hand, all four tones of 
Mandarin are dissimilar in Fo movement (Howie, 
1976). Taiwanese represents an intermediate case. 
Gandour (1983) found that Cantonese speakers 
rely more heavily on Fo height, and less on Fo 
movement, than do speakers of Mandarin and 
Taiwanese when judging various Fo patterns. Our 
finding of the relative importance of Fo movement 
in the perception of Taiwanese tones is not 
inconsistent with Candour's findings. 

One purpose of perceptual studies such as the 
present one is to go beyond linguistic 
nomenclature, which omits acoustic detail, and 
beyond phonetic studies, which describe but do not 
establish the perceptual relevance of these 
detailed aspects. Thus we have shown that the 
Taiwanese high falling and mid falling ton^s, 
which nominally differ only in register, also 
exhibit some perceptually relevant differences in 
Fo movement We have also shown that the 
striking difference in duration between falling and 
nonfalling tones can provide distinctive 
information in ambiguous cases. Bailey and 
Summerfield (1980), in their thorough studies of 
the perception of segmental phonetic distinctions, 
have argi ed that any systematic difference in 
acoustic properties between segments can be 



shown to be perceptually relevant. This 
generalization also seems to apply to the 
perception of tonal distinctions. 
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^language and Speedi, 32, 25-44 (1989). 

tAlso Department of Linguistics, University of Connecticut, 
Storrs. Now at the Lexington Center, Inc, Jackson Heights, New 
York. 

^Throughout this paper, we refer to Fo characteristics of 
phonological tones as Fo register and contour, but to the 
corresponding phonetic dimensions as Fo height and movement. 
Register and contour thus are characterized in discrete terms 
(high, mid, low; rising, level, failing), whereas height and 
movement are continuously variable and are described in 
acoustic terms. The term "register* is not intended to denote 
changes in vocal register. 

^According to traditional classification, Taiwanese has five long 
tones and two short tones. The former occur in syllables ending 
with vowels or nasals, whereas the latter occur in checked 
syllables only (see Chiang, 1967). Because only an open syilalde 
was used as the carrier in the present stuHy, short tones were 
excluded fr jm coruideration. 

3We did not vary amplitude characteristics of the stimuli, to keep 
the design within bounds. To confirm that Taiivanese tones 
could be identified in isolatcrd syllables, tones produced on the 



ERLC 



1Z4 



Cues to the tereeption ofTmwaneu Tones 



147 



sylkbU /do/ by a native speaker '*we presented to native 
listeners for identification in c pUot s^ *. Although confusions 
occurred between tones 2 and 3, and . ween tones 1 and 5, 
average perfdnnance was 87 correct* 

^As a rssult variations in Fo height were restricted for ttie tone 3-5 
and 4-5 contrasts. Altemativdy, we could have dtosen the Fo 
midpoint or the average Fo as our measure of Fo height In that 
case, however, height variations would have been restricted 4*or 
the tone 1-2, 2-3, and 4-5 contrasts. Our choice of Fo onset 
frequency as the relevant measure is consistent with 
phonological terminology for tones, %^ch usually mentions 
onset re^^ster and direction of contour, such as liigh falling^. 

^Figure 2 actually shows a tonal pair of different original 



durations, whose Fo movements have been scaled to a common 
duration. Thr figure is slightly inaccurate because it shows 
curvilinear Fo movements for both original tones. In reality, after 
two temporally contrasting tones had been scaled to a common 
duration, one or both of the Fo movements were linear. 

^For example, each original tone occurred four times because it 
occurred in four different stimulus subsets (Le, in contrast with 
four other tones). The sU^tiy different response percentages to 
ghysicaUy identical stimuli in different ubles (below) derive 
nom the fact that eadi stimulus subset was treated separately in 
the dau analysis. 

^ote that /do/ is not a possible syllable in Mandarin, so secov<d- 
langUAge interference seemed unlikely. 
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Memory for the Words and Melodies of Songs* 
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Six experimenti investigated two explanations for the integration effect in memory for 
songs (Serafine et al., 1984, 1986). The integration effect is the finding that recognition of 
the melo<fy (or text) of a song is better in the presence of the text (or melody) with which it 
had been heard originally than in the presence of a difierent text (or melody). One 
expl«Aation for this finding is the phyaical interaction hypothesis, which holds that one 
component of a song exerts subtle but memorable physical changes on the other 
component, making the latter different from what it would be with a different companion. 
Experiments 1, 2, and 3 investigated the influence that words could exert on the subtle 
musical character of a melody. A second explanation for the integration effect is the 
aeeociation-by-contiguity hypothesis— that any two events experienced in dose temporal 
proximity may become conn«icted in memory such thmt each acts as a reeell cue for the 
other. Experiments 4, 5, and 6 investigated the degree to which both successive and 
simultaneous presentations of spoken text with hummed melody would give rise to 
association of the two components. The results gave encouragement for both explanations 
and are discussed in terms of the distinction between encoding specificity and independent 
associative bonding. 



Stimuli obviously have multiple features. Two 
examples are that ordinary objects have both color 
and shape and that songs have both melody and 
text. Questions about memory representations of 
these theoretically sei. arable but b .mingly 
related components of a song — melody and text — 
motivated our earlier investigations (Serafine, 
Crowder, & Repp, 1984; Serafme, Davidson, 
Crowder, & Repp, 1986). We hypothesized mat a 
song might be represented in memory in three 
ways: (1) independent storage of components (the 
separate entities perceived and 8to^<^^^ so that 
memory for one is uninfluenced by the other); (2) 
hoiloiic storage (the two components sc thoroughly 
connected in perception and memory that one is 
remembered only in the presence of the other); 
and (3) integrated storage (the two components 

Thir reMsrch wai supported by NSF Grant GB 86 08344 to 
R. Crowder and by NICHD Grant HD01994 to Haskins 
Laboratories. The authors are grateftil for the assistance of 
William Flack in testing subjects and to Shari R. Speor for 
discussion of "earlier versions of this paper. 



related in memory such that one component is 
better recognized in the presence of the other than 
otherwise). The holisiic hypothesis is obviously 
false in the general case since people often 
recognize the melodies of familiar songs \ l^en 
they h<)ar them performed on solo instruments, or 
with unfamiliar verses. What this informal 
observation leaves open, however, is whether the 
memory representation consists of independent or 
integrated components. 

In earlier studies we reported evidence for what 
we called an integration effect in memory for 
melody and text. Using a recognition task, we 
found that melodies were better recognized when 
heard with the same words (as originally heard) 
than with different words, even when the different 
words fit the melody and were equally familiar to 
the subject Similarly, we found that the words of 
songs were better recognized in test songs 
containing the original melody than in those 
containing a different but equally familiar melody. 
The procedure we employed was as follows: 
Subjects heard a serial presentation of up to 24 
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unfamiliar folksong excerpts, each heard only 
once. A recognition test followed immediately, in 
which subjects were typically asked to indicate, 
for each excerpt, whether they had heard exactly 
that melody (or text) before, ignoring the current 
text (or melody). The test excerpts consisted of old 
songs (exactly as heard in the presentation) and 
various types of new songs (for example, old 
melody with new words), including a type we 
termed mismatch songs — that is, an old melody 
with old words thac had been paired with a 
different melody in the original presentation. The 
critical comparison, of course, was between 
melody recognition when old songs were heard 
and when mismatch songs were heard, that is, 
when the melody was paired with its original 
companion as opposed to a different, but equally 
famir^r, companion. This comparison, then, 
avoided the potentially biasing effect that 
completely new, unfamiliar words would have on 
recognition if a truly remembered melody. What 
we have termed the integration effect is the 
finding that both melody and text recognition 
were better in the c se of old songs than in 
mismatch songs. We concentrated on the 
facilitating effect of identical words on recognition 
of melodie because recognition of words was in 
some cases almost at ceiling. 

The effect was robust. It was not eliminated by 
instructing the subjects, on their initial hearing, 
to focus attention on the melody (and ignore the 
words), nor was it eliminated by hearing a 
different singer on the recognition test than had 
sung in the original presentation (Serafine et al., 
1984). Moreover, the effect was not due to a 
particular experimental artifact, the potentially 
confusing effect of hearing ^^e melody with 
seemi:.3ly "wrong^ words; the wrong words did not 
make melody recognition suffer, as against an 
appropriate baseline, but the right words 
facilitated it (Serafine e' al., 1986). 

The integration was not accounted for by a 
semantic connotation imposed on the melody by 
the meaningj of the words ^'Serafine et al., 1986) 
because the integration effect was found even in 
songs employing nonsense syllables on 
presentation and test A melody heard only once, 
then, was better recognized in the presence of its 
original nonsense text than with different but 
equally familiar nonsense. This latter observation 
seems inconsistent with a meaning-interaction 
hypothesis that might have considerable intuitive 
appeal. 

In the present studies, we explore further the 
source of the integration effect. Two hypotheses. 



not necessarily incompatible, are under test here, 
the physical interaction hypothesis and the 
association-by-contiguity^ hypothesis. The first of 
these asserts that when a song is sung, the words 
impose subtle effects upon the melody notes, 
slightly affecting their acoustic properties such as 
the onsets, durations, and offsets. We have termed 
these effects ''submelodic'' because they would 
leave unaffected the pitches as they would be 
notated or conceived in composition. For example, 
some words might impose a staccato articulation 
and others a legato phrasing. If this hypothesis 
were correct, then a melody sung with one 
particular text v/ould in fact be a somewhat 
different melody than it were when sung with 
another text It would not be surprising to find 
that the melody were then better recognized with 
the same words both times than with changer*, 
words. A similar argument could be made for 
texts. 

The association-by-contiguity hypothesis asserts 
that two events that occur in close temporal 
proximity (contiguously or simultaneously) tend to 
be associated in memory, though neither was 
necessarily changed by virtue of having entered 
into this association. If this hypothesis were 
correct, then in the limit text and melody would be 
just as well associated if they were experienced 
simultaneously but separately (e.g., words spoken 
and hummed melodies) as if they were given as a 
song. 

In the present research, the first three 
experiments addressed the submelodic hypother^s 
(a special case of the physical-interaction 
hypothesis in which words affec* musical 
properties of the melody) and the latter three 
experiments addressed the association hypothesis. 
All experiments employed our usLal general 
procedure: Subjects heard foiksong excerpts 
followed immediately by a melody recognition test 
where test items contained controlled 
combinations of song components. All experiments 
used vjriations o\ the musical materials and 
design described below. 

General Method 

Musical materials were based on 40 American 
folksongs (from Erdei, 1974, see Serafine et al, 
1984, 1986) which, in earlier experiments, we 
found were virtually all unfamiliar to our subjects. 
There were 20 pairs of song excerpts, each pair 
selected so that melodies and texts were 
interchangeable, having rhythmic compatibility. 
Figure 1 shows such a pair. 
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Interchangeability of melodies and texts was 
crucial to the construction of test items in which a 
song contained a different melody or text than 
that heard originally in the presentation. Thus, 
each text contained a stress pattern suitable for 
either melody, and both texts within a pair 
contain-^ the same number of syllables. The 



exceptions were Song Pairs 11 and 17, where one 
text was shorter by one syllable and required the 
common ''slur" across two tones (see ''sleep" in 
Figure 1, Melody B). Given interchangeable 
components, each pair potentially yielded five 
types of test items examples of which are shown 
in Figure 2: 



Melody T«xt 

17 I i ' I 1 — ' I 

• Whtn Che train conts a-long* When ch« train comes a- lonn. 

Hush a- bye. don't you cry, go to sleep lit- tie babe. 




Hush a- bye. don't you cry, go to sleep lit-tls babv. 
When the train cones a- long.Uhen the train conus a-long* 



Figure 1. Sample pain of songs %vilh intcichangcablc texts. (Aa and Bb denote original songs; Ab and Ba denote 
derivatives^ 



SAMPLE PRESENTATION ITEMS 



SAMPLE TEST ITEMS 



Jii»e « poor way- f«r*lnt ttrant-or. 



Om ymr 90 tetk JAck u4 Jm Mt Mil — 'eroM tta 



Httrc COM* a bltM- bird throu|h .bt— wln- 



Whal will w« 4o with ttm hi4«— I 



t TTnin ji n n \ n 

HoM m Mil* whlta I danca Jo-a«y.Kold my aula whit* I danca. 



Held ay mI* while 1 datica Jo-My,liold tf maU wlill* I d«aco. 



Who'* Chat tap- plug at thf win- dow7 



Hnt' y had a ba* by. 0 Lord. 



I > > > d 




Ha- M buy M a chln-ay doll. Ha<- m buy Ma chln<-«y doll. 



Who'* :hae eap-pltii at ih« wio- dMt 



Ha- m buy M a chin^y doll. Ha- m buy m a chi«-n 4«ll. 



Figure 2. Sample presentation and test items (Code for test items: a - new melody/new words, b - old melody/new 
words, c. new melody/old words, d- old melody/old words, mismatched, e- old songs). 
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The five tsrpes were a • new melody/new words, 
b • old melody/new words, c- new melody/old 
words, d- old melody/old words, m imatched, e- old 
songs. 

In addition, test items might consist of old or 
new melodies alone, that is, hummed melodies 
without words. The present experiments employed 
three to five types of test items, but in each case 
the critical comparison was between old songs and 
mismatch songs, where the latter items allowed us 
to test recognition of one component in the 
presence of a different component that had 
nevertheless been heard in the presentation and 
was equally familiar. 

The fongs were sung in the alto range by the 
second author, recorded onto a master tape and 
dubbed onto sets of experimental tapes with a 5- 
sec interval of silence between presentation items 
and a lO-sec response interval after each test 
item. A silent metronome set at one beat per sec 
facilitated performance at an even tempo, and a 
piano tone (not heard by the subjects) ensured 
pitch accuracy at the start of each song. 

Subjective tempos across the son;^ were not 
uniform, however, due to normal rhythmic and 
metric variations (e.g., "double time'O. The 
presented songs ranged in total duration from 
about 4 to about 10 sec, with a mean of 6.4 and 
standard deviation of 2.01. All songs used C as the 
tonic, although there were variations in mode 
(Dorian, miyor and minor), and starting tone. 
Only slight alterations were made in the original 
folk melodies or texts (e.g., "across* changed to 
"cross*), in order to ensure rhythmic 
interchangeability of materials. 

The same general design was used in all 
experiments. The presentation and test sequences 
always utilized the song pairs in the same order. 
On the presentation tapes, half the songs were 
melodies with their original folksong texts (type 
Aa in Figure 1) and half used the borrowed, 
interchangeable text (type Ab in Figure 1). Each 
mismatch iteji on the test tapes required two 
songs in the presentation sequence (since the 
melody of one would be tested with the text of the 
other). Whenever two such songs occurred in the 
presentation, they followed one another 
immediately on the tape. Natural sources of 
variation among these songs include length, 
nature of the melody, tempo, and subjtfjt matter of 
the text, to name only a few characteristics. These 
factor's were completely controlled, however, by 
counterbalancing across different subjects groups. 



Experiment 1 

The aim of Experiment 1 was to test the 
submelodic hypothesis by employing, on the test 
tape, songs that contained different texts but the 
same phonetic and prosodic pattern as those 
employed on the original presentation. We derived 
phonetically similar texts by translating each text 
into a corresponding nonsense text, where vowels 
were left intact and consonants were changed to a 
reasonably close phonetic neighbor (fli/^gf; /k/=/t/, 
etc.). The presentation consisted of songs with 
nonsense texts, and the test consisted of obviously 
different but phonetically similar texts, that is, 
the real words. If the submelodic hypothesis were 
correct, the integration effect should be obtained. 
That is, a melody should be better recognized 
when it ivpears with words that are phonetically 
similar to the nonserse words with which it was 
originally heard than with words whose phonetic 
derivatives are equally familiar but had been 
heard originally with a different melody. 

Sulgects heard a presentation of 24 songs with 
nonsense texts followed by a 20-item melody 
recognition test containing four each of the 
following types of items: 
(a) "old songs" (old melody with real words that 
are phonetically similar to the nonsense text 
heard with that melody in the 
presentation).^ 
G)) "mismatch songs" (old melody with real 
words that are phonetically similar to a 
nonsense text heard with a different melody 
in the presentation); 

(c) new melody/"old words" (new melody with 
real words that are phonetically similar to a 
nonsense text heard in the presentation); 

(d) old melody hummed, (that is, without 
words); 

(e) new melody hummed. 

The main question was whether melody 
recognition would be better in the "old song" than 
in the "mismatch" condition. The hummed test 
items provided a baseline for melody recognition. 

Method 

Materials 

Using the songs described under General 
Method, each text was translated into 
phonetically s'milar nonsense usL g the rules 
described in Experiment 1 of Serafine et al, 
(1986). The following are examples of translated 
texts written following English orthography: 
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Original: Cobbler, cobbler, make my shoe. 

Nonsense: Toggluc , Togglue, nate nie choo. 

Original: Cape Cod girls they have no combs. 

Nonsense: Tade top berf shey jaze mo tong. 

Design 

Five sets of presentation and test tapes were 
constructed, each administered to a different 
group of subjects. Presentations consisted of 24 
song excerpts with nonsense texts, and test 
ftequences consisted of 20 items containing real 
words, where each of the five types of items 
occurred 4 times. Across five subject groups, each 
presentation item was tested in each of the five 
conditional For example, the first presentation 
item was tested as an "old song^ in one group, as a 
"mismatch song^ in another group, as an old 
melody hummed in another group, and so on. 
Because each "mismatch song" required hearing 
two songs in the presentation, the presentation 
tapes consisted of 20 song excerpts plus 4 
additional ones for the "mismatch" condition. 

Procedure 

Testing was conducted individually in a quiet 
laboratory with tapes heard over loudspeakers. 
Subjects were instructed to listen to a presen- 
tation of 24 excerpts that would sound l?ke 
folksongs except that the texts had been changed 
to nonsense. They were told that their "memory 
would be tested later" but not informed that only 
melody recognition would be tested. The test 
immediately followed the presentation. Subjects 
were told that test items would consist of hummed 
melodies or songs with real words, but in all cases 
they were only to indicate whether they had 
"heard that exact melody before-that is, just the 
musical portion." Subjects indicated "yes" or "no" 
on the answer sheet and gave a confidence rating 
that ranged from 1 to 3. 

Subjects 

Twenty Yale undergraduates with unde- 
termined levels of musical training were equally 
divided among the five groups. 

Results and Discussion 

Yes/no responses with confidence ratings were 
translated into single scores with a theoretical 
range of 1 to 6, where 1 represents very confident 
no (did not hear melody) and 6 represents very 
confident yes (did hear melody). The mean ratings 
for "old song," "mismatch song," new melody "old 
words," old hummed melody, and new hummed 
melody, respectively, were 3.99, 4.44, 3.56, 4.00, 
and 2.92. An analysis o.* variance w^^h subjects as 
the sampling variable showed conditions to be a 



significant source of variance, F(4,76)s9.45, p < 
.001. The Newman-Keuls procedure was used to 
identify which comparisons produced the 
significant overall effects. Evidence that melody 
recognition was above chance was provided by the 
fact that the mean rating for old melodies 
hummed (4.00) exceeds that for new hummed 
melodies (2.92), p < .01, as well as by the fact the 
rating of "mismatch songs" (4.44) was reliably 
greater than the condition with new melodies and 
"old words" (3.56). However, "mismatch songs" 
generated a higher mean rating (4.44) than did 
"old songs" (3.99), contrary to our expectations 
based on the submelodic interpretation of the 
integration effect Thus, phonetically similar "old 
song" texts did not enhance recognition for the 
original melodies. 

In this experiment, chen, the submelodic 
hjrpothesis was not supported. No evidence 
emerged that the nonsense words in the 
presentation imposed effects on the melodies 
which would allov the phonetically similar real 
words to enhance melody recognition. As a test of 
the submelodic hypothesis, however, this 
experiment seemed in retrospect to have been 
compromised by the fact that subjects heard real 
words on the test after having heard nonsense in 
the presentation. Possibly the surprise of a full 
semantic experience after originally studying 
nonsense may have been distracting. In 
Experiments 2 and 3 we addressed this problem. 

Experiment 2 

One aim of the present experiment was to verify 
that the integration effect could be obtained with 
newly-constructed nonsense texts that would be 
necessary for Experiment 3, where the submelodic 
hypothesis was tested again. In an earlier study 
(Experiment 1 of Serafine et al., 1986) we had 
shown, as noted above, that the integration effect 
was robust with nonsense texts of the sort used in 
Experiment 1 of the present paper. In the present 
and following experiments, however, somewhat 
different rules were employed for the construction 
of nonsense texts, and we sought to verify that the 
resulting new materials would give rise to the 
integration effect. Following a presentation of 
song excerpts with nonsense words, a melody 
recognition test employed only three types of test 
items: 

(a) old songs (old melody, old nonsense words) 
exactly as heard in the presentation); 

(b) mismatch songs (old melody with old 
nonsense words that had been sung to a 
different melody in the presentation); 

(c) new melody/old words. 
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The critical comparison was that between the 
old song and mismatch conditions. The main 
prediction was that melody recognition would be 
enhanced by the presence of the original old 
nonwords (in the old songs) over that obtained 
with the different but equally familiar nonwords 
(in the mismatch songs). 

Method 

Materials 

The design of Experiment 2 required only 
eighteen of the 20 song oairs we had available. 
(Two pairs were omitted on the grounds that thqr 
had proven at least somewhat familiar to some 
subjects in Experiment 1). Because, in Experiment 
3 reported below, two nonsense texts were 
required for each real text, it was necessary to 
employ new rules for translating real words into 
nonsense. The new rules, z% follows, were similar 
to, but more detailed than, those of Experiment 1 
and allowed fewer deviations. For example, the 
voicod/unvoiced distinction was was preserved 
across transformations: 

(1) Vowels remain the same, and the following 
vowel-liquid sequences are treated as intact : /er/ 
as in Mary, /ar/ as in far, liV as in will, hrl as in 
lore, h\l as in hoy, Iwl as in how, /eiV as in pail, 
/al/ as in doU, /Dl/as in awl, Iwtxl as in rum. 

(2) Consonants are interchanged according to 
the following list of phonetic similarities. For 
example, if Ihl occurs in a real word, the two 
corresponding nonsense words use /d/ and 1^, 
respectively: 

/r/^=^ or /w/; /w/=/j/=/r/ or /I/ 
/p/Wt/^ 

/17We/r-7h/or/s/;/e/=/f7=/J/or/s/;/h/=/3/ f7;/s/= 

/tr/rVkw/=/pV 

/pr/=/tw/^y 

/kr/=/tw/=/pl/ 

/st/=/sk/=/sp/ 

/sy=//w/=/fr/ 

/skw/=/str/=/spl/ 

/by=/dw/=/gr/ 

/sp/=/st/=/sk/ 

terminal /n/=terminal /m/=terminal /g/ 

Special cases of translated vowel/consonant 
combination: 

/»^l/=/»'m/=/».n/ (e.g., girl = berm, dern) 
/ir/5/il/^in/(e.g., here « seaUfeen) 



/An/=/Am/=/A V 

terminal /3i)/=/3n/=/Dm/ (e.g., song= 

fawn, shawm) 
terminal Ag/ = /mi/ = An/ 

(3) Interior /e^/ or /^n/ is treated as a vowel, but 
in terminal position is inter<;hanged as follows: 
M^^n/^^V as in anger^ften^le. 

(4) A terminal /s/ or /z/, when a plural marker, 
may be retained (untranslated) if the resulting 
nonsense is too difficult to pronounce. 

(5) The sounds /tj/ and /d> / are omitted from all 
real texts because three suitable phonetic 
correspondences do not exist. Thus, minor changes 
in some real texts were made \e.g., Joe changed to 
Moe, chase to run). 

The following is an example of a translated text, 
written in the form (regular orthography) used by 
the singer: 

Original: Cobbler, cobbler make my shoe 
Nonsense 1: Poggrel, poggrel nate nie foo. 
Nonsense 2: Toddwen, toddwen lape lie thoo. 

Only one ^et of nonsense texts was used in this 
experiment 

Design 

The design was comparable to that of 
Experiment 1. Three sets of presentation and test 
tapes were administered to different sets of 
subjects. Across the three subject groups, each 
presentation item was tested in each of the three 
conditions: old song, mismatch, and new 
melody/old words. The presentation tapes 
consisted of 24 songs, and test sequences consisted 
of 18 items, six each of the three conditions. 

Procedure 

The testing procedure was comparable to that of 
Experiment 1. Subjects were instructed to listen 
to a presentation of folksong excerpts with 
nonsense texts, were told that their ''memory 
would be tested later," and following the 
presmtation were given the melody recognition 
test in which they were to indicate whether they 
had *Tieard this exact melody before- that is, just 
the musical portion." They were not told what 
types of items to expect on the test except that the 
nonsense folksongs would be similar to those on 
the presentation. 

Subjects 

Fifteen Yale undergraduates with undetermined 
levels of musical training were equally divided 
among the three groups. 
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Results and Discussion 

As in Experiment 1, responses were translated 
into 6-point ratings where 1 represents very 
confident no (did not hear melody) and 6 
represents very confident yes (did hear melody). 
Mean ratings for the old song, mismatch, and new 
melody/old words conditions were 4.86, 3.63, and 
2.59 respectively. The results of two omnibus 
analyses of variance were significant: With 
subjects as the sampling variable, F(2,28)=42.66, 
p < .001, and with the 18 test items as the 
sampling variable, F(2,34)=41.76, p < .001. 
Newman-Keuls tests revealed that melody 
recognition was significantly better in the old song 
than in the mismatch condition, both across 
subjects (p < .01) and across items (p < .01). 

Thus the integration effect was confirmed with 
these new materials, verifying that the presence of 
original old words^^ven nonsense words absent 
of semantic meaning — facilitates melody 
recog ition over that obtained with the different 
but equally familiar words, in the mismatch 
Songs. Besides vindicating our new stimuli, the 
results of Experiment 2 provide welcome 
replication for one of our most important previous 
results: In this nev.* experiment, mean ratings for 
the old song and mismatch* conditions, 
respectively, were 4.85 and 3.63; corresponding 
means fi^m Experiment 1 of Serafine et a!. (1986) 
were 4.47 and 3.76. 

Experiment 3 

The aim of Experiment 3 was to retest the 
submelodic hypothesis, which had not been 
confirmed in Experiment 1. To avoid the 
potentially distracting use of both nonsense (at 
presentation) and real words (at test), which may 
have influenced the outcome of Experiment 1, we 
employed two different sets of phonetically derived 
nonsense texts, based on the same real words 
(which were never used in this experiment). The 
presentation consisted of folksong excerpts with 
nonsense texts. The test consisted of folksong 
excerpts whose texts were phonetically similar to 
those in the presentation but nevertheless were, 
in all cases, different nonsense. (As in Experiment 
1, the phonetic derivative of an old song was called 
an "old song," etc.) Test items were of three types: 

(a) ''old songs" (old melody with nonsense words 
phonetically similar to the old nonsense ' 
text); 

(b) ''mismatch songs" (old melody with nonsense 
words phonetically similar to an old 
nonsense text from a different song in the 
presentation); 



(c) new melodyroH words" (new melody with 
nonsense words phonetically similar to an 
old nonsense text). 

If the submelodic hypothesis were correct, that 
is, if words impose subtle and memorable effects 
upon their melodies, then a melody should be 
better recognized when it is heard with nonsense 
words that are phonetically related to the 
nonsense with which that melody was originally 
presented than when heaid with nonsense that is 
not phonetically related to the original. In other 
words, melody recognition in "old songs" should 
exceed that in "mismatch songs." 

Method 

Materials 

The materials were thos<% described under 
Experiment 2. Both sets of nonsense texts, 
(phonetic derivatives of the original folksong 
texts) were employed. 

Design 

The design was comparable to that of 
Experiment 2, except that the test items, instead 
of comprising old songs, mismatch songs, and new 
melodies with old words, used the phonetically 
derived "old songs," "mismatch songs," and new 
melodies with "old words," where our quotation 
marks indicate that exact repetition of the verbal 
texts between learning and test never occurred. As 
in earlier experiments, counterbalancing across 
subjects groups was employed to control for 
natural variations in the songs. The presentation 
consisted of 24 items and the tests consisted of 18 
items. 

Afler 12 of the 30 subjects had been tested, an 
inadvertent error in the test tapes was detected. 
Two song pairs contained faulty material for the 
condition new melody/"old words," although the 
other two conditions were correct. Thus, scores for 
those 12 subjects were based on four (instead of 
six) items in the new melody "old words" 
condition. 

Procedure 

The procedure was analogous to that of 
Experiments 1 and 2. At test, subjects were told 
that the texts of songs may sound similar to or 
different from those heard before, but they were to 
attend only to the melody ard indicate recognition 
(yes or no) and a confidence rating on the answer 
sheet. 

Subjects 

Thirty adults with undetermined levels of 
musical training were paid to participate and 
were equally divided among the three groups. 
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Results ynd Discussion 
As before, melody recognition ratings had a 
possible range of from 1 to 6. Means for the ""old 
song," "mismatch/ and new melody with ''old 
words'* conditions were 4.26, 3.88, and 3.05, 
resp3ctively« With subjects as the sampling 
variable, the result of an analysis of variance was 
significant, ^X2,58)=28, 18, p < .001, and Newman- 
Keuls tests indicated that melody recognition 
under the "old song* condition was significantly 
better than that under the "mismatch" condition, 
p<M. 

With items as the sampling variable, an 
analysis of variance was performed on means 
generated only by the 18 subjects who had 
completed all items in all conditions. Those 
means, for the "old song," "mismatch," and new 
melody "old words" conditions respectively, were 
4.10, 3.71, and 3.04. The main effect war 
significant, F(2,34)=12.02, p < .001, and a priori 
comparisons involving only the first two 
conditions revealed significance at the .02 level. 
(The results of post hoc tests were not significant, 
however.) 

The results of the present experiment show that 
the integration effect is obtainable with 
phonetically similar nonsense us«d at cest. One 
plausible explanation for this result is that words 
exert specific, albeit subtle, and memorable effects 
on the melodies with which they are sung. These 
submelodic effects include the manner in which 
consonants (perhaps also vowels) affect the onset, 
duration, and offset of particular melody tones. In 
our view, it seems indisputable that words exert 
yaiiable effects on melody tones, as can be easily 
imagined, for example, in the case where two 
tones accompany the words "tip-top" as opposed to 
"ho-hum." 

We think it not an accident that the present 
experiment showed evidence favorable to the 
submelodic hypothesis, whereas earlier efforts 
with a similar experimental design did not. The 
rules for deriving phonetically-similar nonsense 
texts were more fastidious here than those used 
before: For example, in these new materials we 
respected the voiced/voiceless distinction more 
consistently than under the old rules. Consonants 
with stop closures were distributed equally in the 
original and derived versions, too. These 
distinctions are just the sort that would be 
expected to underlie a submelodic effect of words 
on music. 

Other interpretations of the integration effect, 
for example those to be considered below, might 
also be consistent with the evidence adduced here 



for the submelodic hypothesis. Comparing 
Experiments 2 and 3 of the present series, we note 
a smaller, and statistically weaker integration 
effect in the latter, with the derived nonsense 
words, than in the former, with the very same 
nonsense texts presented at learning and test. 
This is as it lihould be, by any commonsense view, 
for no scheme for deriving "similar^ phonetic texts 
could possibly be as faithful a reinstatement as 
complete identity. On the other hand, we should 
not exaggerate the triumph of the submelodic 
hypothesis: At most, we can claim that we have 
shown conclusively that some such factor is 
operating somehow in our integration 
experiments, not that it is an answer as to the 
complete cause of the effect. 

Introduction to Experiments 4 through 6 

The remaining three experiments investigated 
the degree to which the melodies and texts might 
be associated in memory because of their close 
temporal proximity, as successive events in 
Experiments 4 and 5, and as simultaneous events 
in Experiment 6. These experiment address what 
we referred to above as the aaaociation-by- 
contiguity hypothesis. The term association, by 
itself, may connote many things theoretically, 
such as rote learning, Pavlovian conditioning, and 
pre-cognitive, antediluvian mists of antiquity. 
However, its denotation is theoretically empty: It 
simply stands for an experimental fact, that 
events A and B stand in a particular empirical 
relationship because of their history of co- 
occurrence. The challenge for theory is to 
ration.'tlize the circumstances necessary for that 
association to be formed tuid the nature of the 
bonding th.Teby achieved. Thus, our integration 
result illu strates some form of association, 
without doubt. The submelodic mechanism, for 
which we adduced some support in Experiments 1 
to 3, is not strictly an associative mechanism at 
all, but rather an effect of one element upon the 
nature of the other, namely that the occurrence of 
A with B changed the physical nature of B. We 
now ask whether the temporal contiguity of A and 
B, words and melody respectively, is a sufficient 
condition for their association when no possibility 
exists for an overt influence of one upon the 
physical integrity of the other, as with the 
submelodic mechanism. 

In considering the theory of associations we 
have relied upon the distinction in the respective 
psychologies of James Mill and of John Stuart Mill 
between mental compounding and mental 
chemistry (see Boring, 1957, chapter 12). In the 
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former case two components retain their 
independent identities, yet are connected to one 
another. In the latter case the two components are 
themselves altered by each other's presence. Our 
concept of the as8ociation-by<ontiguity of melody 
and text is like that of mental compo'inding: the 
melody and text are connected in memory, hence 
act as recall cues for each other, yet each is stored 
with its independent integrity intact. By contrast, 
the submelodic hypothesis in the first three 
experiments is consistent with a somewhat more 
chemical form of bonding, for a melody and text 
change each other physically when sung together 
in a song. A more purely mental chemistry could 
be an associative process in which, by co- 
occurre ce in the mind, the memory 
representation of each is changed as against what 
it would have been without a particular 
companion. 

More recently, a similar distinction has been 
articulated by Horowitz and Manelis (1972), albeit 
with a linguistic orientation, for a4jective-noun 
phrases. They refer to a distinction betw<^jn 7- 
Bonding (where I stands for individual or 
independent) and J-Bonding (where J stands for 
Joint). The former, illustrated by the phrases deep- 
chair or dark-wing, cake their meaning as a 
phrase from the meanings of the constituent 
words. The latter, illustrated by high-chair or 
right-wing, possesses idiomatic meaning that 
transcends the meanings of the several 
constituents. As Horowitz and Manelis remarked 
(p. 222), I-Bonding owes allegiance to the British 
empiricist philosophers and J-Bonding to the 
Gestalt tradition. Tulving's work on recognition 
failure in episodic memory (Tulving, 1983) 
illustrates the same properties as J-Bonding, 
wherein an element of an association can be only 
poorly recognized but can be well recalled given 
the original associate as a cue. In many ways we 
believe that these issues are raised in their most 
stark relief when the two constituents, such as 
words and melodies, are fundamentally different 
cognitive elements than when intraverbal 
associations are at stake. 

Experiment 4 

The present experiment investigated the 
concept of contiguity as a sufficient condition for 
association. It assessed the degree to which a text 
could serve as the retrieval cue for a melody, when 
the two had initially been heard in close temporal 
proximity (in this case successively), yet not as a 
proper song. Each component in Experiment 4 
was strictly independent physically: Texts were 



spoken and melodies were hummed. Using a 
technique similar to those reported above, we gave 
sub.'ects a serial presentation of spoken texts, each 
foHov ed by -"s correS|.onding melody, hummed, 
and then each text*melody pair was followed by a 
10*sec intei.al of silence during which subjects 
were to ''imagine'' the song. A melody recognition 
test followed in which true (sung) songs and 
hummed melodies were heard. If subjects had 
managed to imagine the songs, as instructed 
during original presentation, they should have 
behe ^ed in the same way as subjects in our earlier 
experiments, who had actually heard the songs. 
The test items were of five types (quotation marks 
indicate a deviation from what was heard in the 
presentation): 

(a) *olu songs" (the text and melody of one pair 
frou. the presentation weie sung together as 
a song in testing); 

(b) "mismatch scngs" (the text from one pair 
and the melody from a different pair were 
sung together as a song); 

(c) "new melody/old words" (the text from one 
pair heard in the presentation was sung 
with a new melody); 

(d) old melody hummed (exactly as heard in the 
presentation except not preceded by words); 

(e) new melody hummed. 

The main question was whether the "old song" 
condition would generate better melody 
recognition than would the "mismatch" condition. 
Such an advantage could derive from either of two 
processes, corresponding to I- or J-Bonding in the 
terminology of Horowitz and Manelis (1972). If the 
simple contiguity hypothesis were correct, we 
would expect that melodies and texts presented in 
close succession would be connected in memory, 
hence could act as recall cues for one another. 
Thus, in a melody recognition test consisting of 
true (sung) songs, the melody should be better 
recognized when heard with the text with which it 
is connected than with a different (mismatched) 
but equally familiar text. Likewise, as we said 
above, if subjects are Me to fuse the melodies and 
text mentally, using the 10 sec ''imagine" period as 
they were instructed, to create a song-like memory 
representation, then, too, the old song condition 
would produce better melody recognition than the 
mismatch condition, as in the earlier experiments. 
Thus, the conditions of this experiment could not 
permit a choice between these two hypotheses. 
That would require consultation of still further 
experimental arrangements. However, a positive 
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outcome here would be a necessaiy, if insufficient, 
condition for either hjrpothesis. 

Method 

Materials 

The 20 pairs of folksong excerpts were used with 
theif real (not nonsense) words. Tape recordings of 
spoken texts and hummed melodies were made by 
the same female alto that was the singer in our 
earlier studies. Each spoken text generally 
followed the rhythm of its companion melody, 
consumed approximately the same amount of 
dme, and had the character of poetic speech 
rather tlian normal, conversational speech. 
Desfgn 

The design was comparable to that of 
Experiment 1, -vith five sets of tapes used to 
counterbalance the test conditions for each 
presentation item across five subject groups. The 
presentations consisted of 24 text-melody pairs, 
where the spoken text and hummed melody in 
each pair were separated by a one-sec interval and 
followed by 10 sec of silence. The test consisted of 
20 items, four each assigned to the five conditions. 
Procedure 

The procedure was generally the same as that of 
the earlier experiments, except that subjects were 
told they would hear spoken texts and hummed 
melodies from simple folksongs and that they 
were to use the 10-sec interval to "imagine the 
words and melody together as though someone 
were singing them"* and to ^sing the song in your 
head." On the test, subjects were told to expect 
either true (sung) songs or hummed melodies and 
to indicate melody recognition as in the other 
experiments. 

Subjects 

Twenty five adults with undetermined levels of 
musical training were divided equally among the 
five groups. 

Results and Dis:::ssion 

As in earlier experiments, melody recognition 
ratings had a cheoreMcal range of ] to 6. Mean 
ratings respectively for the "old song," •'mismatch 
song," "old ATords/new melody," old melody 
hummed, and new melody hummed were 4.24, 
4.08, 3.63, 4.28, and 3.23. The result of an analysis 
of variance witli subjects as the sampling variable 
gave a statistically sigr^-ficant main effect of 
condition, Fi4, 96) = 5.38, p < .001. Newman-Keuls 
tests showed that melody recognition, on its own, 
was better than chance; there was a significant 



difference in the baseline comparison between the 
means for old and new hummed melodies (4.28 
and 3.23 respectively), p <.05. This conclrs^on is 
bolstered by the fact that the : lelodies of "old 
songs" yielded a higher mean rating (4.24) than 
did "new melodies with old words" (3.63), p < .05. 
However, no integration effect for texts tnd 
melodies occurred. That is, the mean rating for 
"old songs'* (4.24) was not significantly higher 
than tJiat for "mismatch songs" (4.08). There was, 
then, no advantage for melody recognition 
conferred by the presence of original words over 
different but equally fiiMniliar words. 

Thus, we can report no evidence that a melody 
and text heard iii succession are better recognized 
later in each other's pr« jence, even when 
instructions had been given to imagine the Lwo 
presentation components as a song. Several 
easons could possibly underlie the failure to find 
an association effect here. The problem may have 
been in the generation process, for one thing: 
wubjects may have been simply unable to imagine^ 
the combined components as a unified song. Or, 
the imagination instructions themselves may 
somehow have served to distract the sutjects fiom 
encoding the two components even as "normal" 
paired associates. Further, in this e periment an 
unprecedented inconsistency existed between 
what was heard in input (successive spoken 
speech and hummed melodies) and what was 
tested for recognition in output (sung songs); this 
may have been distracting. And of course it is 
possible that spoken and hummed stimuli of the 
sort employed here are not conducive to either a 
process of song generation or of contigv^ous 
associative bonding. 

A special circumstance introduced by spoken 
speech and hummed melodies is the introduction 
of elements in each stream that are foreign to the 
identity of these as components within normal 
songs: The prosody of normal speech necessarily 
introduces intonation gestures (phrasal 
declination for example) that would not be 
compatible with the sung version. The act of 
humming, likew'se, cannot but introduce nasal 
and vowel segments that might otherwise not 
reside in the spoken text. Therefore, our 
abstraction of the spoken and melodic streams in 
the methodology of this set of studies is not an 
absolutely neutral operation. As so often, negative 
results are ambiguous, but positive results (see 
below) ?peak with considerably greater force. 

In our next experi;nent we focusrd i . t^e 
possibility that the instruction to t-ner e songs, 
at presentation, had backfired even to the extent 
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of preventing the formation of independent (I- 
bonded) associations. 

Experiment 5 

Expenment 5 differed from Experiment 4 in two 
ways: First, no instructions for imagining songs 
were given at presentation. Second, test items 
contained the same format of successive 
components as had the presentation, that is, 
spoken words followed by hummed melodies. 
Abendoning the imageiy instruction was intended 
to determine whether a melody and text could 
become independently associated in memory if the 
two components had been in close temporal 
proximity. Having established baseline melody 
recognition in Experiment 4, we saw no need to 
repeat the two hummed-melody conditions. The 
melody recognition test consisted of successive 
text/me?ody pairs under the following three 
conditions: 

(a) old songs ir spoken text followed by its 
nummed melody, exactly as heard in the 
presentation); 3 

(b) mismatch songs (a spoken text followed by 
the hummed melody of a different 
text/melody pair from presentation); 

(c) old text with a new melody that had not 
been heard in the presentation. 

Method 

Materials 

Eighteen of the 20 song pairs used in 
Experiment 4 were used in the present 
experiment. 

Deslfpi 

The design was comparable to that of previous 
Experiments 2 and 3. Three sets of tapes 
counterbalanced the test conditions for each 
presentation item across three subject groups. The 
presentation consisted of 24 text/melody pairs, 
and the test consisted of 18 text/melody pairs, 6 
each >f the three conditions. 

Procedure 

The procedure was comparable to the earlier 
experiment). Subjects were told to expect pairs of 
spoken texts and hummed melodies on the 
presentation and test. Melody recognition ratings 
were obtained as before. 

Subjects 

Fifteen adults with undetermined levels of 
musical training were equally divided among the 
three groups. 



Results and Discussion 

As before, melody recognition ratings had a 
theoretical range of 1 to 6. Means for the old song, 
mismatch, and old words/new melody conditions 
were 4.22, 4.12, and 2.98 respectively. Clearly, no 
significant difference emerged between the old 
song and mismatch conditions. In other words, 
hummed melodies were not better recognized 
when preceded by the same spoken text which had 
preceded that melody at presentation than when 
preceded by a different (yet equally familiar) text. 
Thus, no evidence was found that a link in 
memory is engendered by the successive 
presentation of independent texts and melodies. 
This leaves open the possibility that mental 
compounding is not the agency for the integration 
effect between melody and text reported in our 
earlier experiments. If not, then the su^jmelodic 
h}rpothesis remains the only explanation for the 
effect with evidence in its favor. However, a 
hngering question is whether a simultaneous 
presentation of spoken text and hummed melody 
could give rise to an association in memory of the 
two components. This was addressed in the 
following experiment. 

Experiment 6 

The objective of this experiment was to assess 
the degree to which a simultaneous presentation 
of spoken words and hummed melody could give 
rise to an integration of the two such that the 
melody was recognized better in the presence of 
the text with which it had originpUy been 
presented than in the presence of a different 
(equally familiar) text. This result would indi^^ate 
that our reasoning about independent associative 
bonding had been correct, in Experiments 4 and 5, 
but our realization of contiguity had been 
inadequate. 

The presentation episodes consisted of normal 
spoken texts and hummed melodies heard 
simultaneously and binaurally (but not 
dichotically). We refer to these simultaneous 
pairings as "spoken songs." The later recognition 
tests were ol two types: Half the subjects heard 
only spoken songs (as in presentation) and half 
heard true, sung songs. Again, no instruction for 
the generation of song-like representations was 
given. So the question at hand was whether an 
association between contiguous components, if it 
occurred, would influence melody recognition only 
if the test stimuli were like those of the 
presentation or whether that association's 
influence would extend also to the case of true 
songs. 



Physiad Interaction and Assoc ktian by Contmity in Memory for the Words and Melrdi^ nf ^nnt^ 159 



The melody recognition tests for both the 
(between-subject) conditions with spoken songs 
and true songs consisted of three within-subject 
conditions. As before, the critical comparison was 
that between old songs and mismatch songs. 

(a) old songs (same text and melody as was 
heard in the presentation); 

(b) mismatch songs (the text of one pair and the 
melody of a different pair heard in the 
presentation); 

(c) old words with new melody. 

Method 

Materials 

Eighteen of the 20 song pairs were used. 

A master tape, from which experimental tapes 
^ ere dubbed, was prepared by the same alto 
singer, as follows: Hummed melodies were first 
recorded in succession, each preceded by exactly 
four evenly spaced taps, also recorded onto the 
tape. The resulting signal was then fed into a 
second tape recorder at the same ti»T.e that spoken 
texts were recorded onto a second tape. Ine singer 
listened to the hummed melodies from the first 
tape over headphones, using the four taps to fix 
the onset of the hummed melody, and then spoke 
the text along with the melody, recording both 
onto the second tape. Texts were generally spoken 
in the rhythm of the melody and also began and 
terminated in synchrony with it. When 
experimental tapes were dubbed from the master, 
the four taps were omitted. The test tapes 
employing true songs were the same as those 
employed previously with these materials. 

Design 

The design was exactly analogous to that of 
Experiment 2, except that two sets of test tapos 
were constructed, each administered to a different 
group of subjects, one set with spoken songs and 
the other wiUi true songs. 

Procedure 

The procedure was comparable to that of the 
earlier experiments, with subjects told to expect 
the spoken texts of simple folksongs to be heard 
simultaneously with hummed melodies. At test, 
one group was told that items would be true, sung 
songs, and the other group tha* test items would 
be similar to presentation items. In all cases, of 
course, instructions called for recognition based 
only on the melodies. 



Subjects 

Twenty-four adults with undetermined levels of 
musical training were equally divided between thi* 
two test groups. 

Results and Disctission 

Melody recognition ratings had a theoretical 
range of fipom 1 to 6. Mean ratings for old songs, 
mismatch songs, and old words/new melody 
respectively were 4.56, 4.04, and 3.33 when the 
test consisted of spoken songs (as heard in the 
presentation) and 4.35, 3.96, and 3.25 when the 
test consisted of true songs. Two mixed analyses of 
variance were performed with type of test (spoken 
vs. true songs) as a between-subjects variable and 
the same three conditions ("old song,* *^smatch," 
and new melody/"olJ words") as a within-subjects 
variable. With subjects as the sampling variable, 
only the main effect of conditions was significant, 
F(2,44)«21.68, p <.001; neither the type of test 
main effect nor the interaction was significant. 
The Newman-Keuls test indicated that combined 
"old song^ ratings for both groups ^4.46) exceeded 
that for "mismatch songs" (4.00), p < .05. 
Similarly, with items as the sampling variable, 
only differences among the three conditions were 
significant, F(2,68>=13.65,p < .001. The Newman- 
Keuls again test supported the difference betwee*. 
"old song" and "mismatch" ratings, p < .05. 

Thus, by the reasoning of the fourth and fifth 
experiments here, true temporal contiguity was 
the necessary condition for observing our 
integration effect. The most straightforward 
interpretation of that result is that, in Experiment 
6, conditions were favorable for the formation of 
independent associative links between 
constituents that had not lost their individual 
identity. Close temporal proximity, as in 
Experiments 4 and 5, was apparently not enou^. 

Here, for the first time in this series, we may 
rule out the submelodic hypothesis, because the 
pairing manipulation cannot have had any 
substantial effect on the physical nature of each 
constituent.^ Likewise, the cognitive version of the 
submelodic hypothesis— J-Bonding indicative of 
what we have called "mental chemistiy"— received 
no encouragement from these last three 
experiments. What makes : difference is not 
whether or not people try actively to integrate the 
melody and words in their minds, constructing an 
unheard song, but whether or not the two were 
strictly simultaneous. 
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Straightforward as this interpretation is, we are 
not 80 naive as to believe that one absent 
interaction from a single experiment can 
overthrow ideas as important as J-Bonding or 
Encoding Specificity. Besider the usual caution 
that we need more converging evidence on this 
point, it could be argued, albeit with sc^ie 
considerable added assumptions, that all subjects, 
in both groups of this experiment, left the 
presentation sequence with self-^generated songs 
as memo^ representations. Hearing a bummed 
melody at the same time as one hears a 
rhytbjiically-matching stream of words might 
produce the experience of a song, whether the 
subject is trying to generate this or not. This 
would account for the integration efTect among 
subjects tested with real songs. To account for the 
same effect in subjects tested with spoken songs, 
we need only observe that for these people, the 
conditions of acquisition and testing were exactly 
the same, which could have outweighed the 
disadvantage produced by the need for these 
subjects to generate song representat. s at test, 
as well as at acquisition. 

An automat', process generating song-like 
representations from simultaneous, compatible 
verbal and musical streams would not be 
unexpected from a consideration of speech 
processing: In our ordinary li^es, a simultaneous 
mixture of this sort is the rule rather than the 
exception, because the segmental features (in 
words) are always overlaid upon supra-segmental 
features, including specifically variation in 
fundamental frequency. For this ecological reason, 
simultaneous variation in pitch might be assigned 
to the prosodic aspect of speech automatically, 
even when the listener "knows" the verbal and 
tonal messages are nearly independent, as in 
listening to songs, or spoker songs. These 
considerations lead us to the design of future 
experiments better exploiting the tonal prosody 
and spoken message of integrated language 
communication. 

General Discussion 

As for the integration effect in song, our 
experiments in this and in the two previous 
papers (Serafine et al., 1984; Serafine et al., 1986) 
have guided our thinldng in a number of ways. 
First of all, and despite musicological folklore to 
the contrary, the meaning of wo/ds seems to have 
a negligible role in the fact that melody and words 
of folksongs become stored in an integral fashion. 
Here again, in Experiment 2, the result withstood 
nonsense materials devoid of conventional 
meaning. 



Secondly, we have adduced statistically reliable 
support for the submelodic hypothesis, suggesting 
that particular words can change the musical line 
sufficiently to influence recognition of the 
melodies later. It is no wonder people are slow to 
realize that ''Baa, Baa, Black Sheep" and 
Twinkle, Twinkle, Little Star" are words to the 
same tune — they are not, musically, quite the 
same tune, by virtue of the words to which i^ch 
has been set. 

Finally we have uncovered a number of factors 
that govern the size of the effect, some 
statistically reliable on their own and others not. 
In retrospect, it was perhaps misguided for us to 
have thought that a single factor would control 
the integration effect. Among the agencies for 
which evidence exists, we must include first 
temporal contiguity, as shown in Experiment 6 
here. Barring the unknown contribution of 
automatic fusion of text and melody in songs, 
hearing words and the melody aw the same time 
appears to affect their joint storage in the manner 
of paired associates. But we should not discard 
completely those factors uncovered by earlier 
experiments in this series as potential factors; 
even though they were not reliiA)le on their own, 
they did measurably affect the size of the effect. 
Among these, we count instructions to attend only 
to melodies rather than to the whole songs at 
presentation (Serafine et al., 1984). Similarly, in 
the same experiment, acoustic non-identity of 
presentation and test materials (different singers, 
respectively) had an efTect in the direction that 
would have hi an predicted (though not 
significantly). Elsewhere (Serafine et al., 1986 and 
here, in Experiment 4) we found that melodies in 
the presence of the wrong words did indeed have a 
distracting effect on melody recognition, beyond 
tne facilitation that the correct words had. 

Putting all these factors together, we believe we 
know well how to arrange conditions so as to 
maximize, or minimize, the integration of words 
and melodies in recognition of songs. This 
laboratory control is not unsatisfactory as 
explanation, provided one gives up the goal of 
having only one crucial component. 
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FOOTNOTES 

^Memory & Cognition, in press, (in a shorter version). 
tYale University. 

^We are wdl aware that the dictionary meaning of the word 
contigmty stipulates that the events in question be juxtaposed, 
or adjacent in time, but not overlapping or coterminous. "Htis 
departs from usage of the term within psychology, where 
successive and simultaneous arrangements are both considered 



contiguous. In this paper we remain with this Utter usage even 
^ough the former might be more justifiable to some scholars. 

^Throughout this paper, quotation marks on test item labeb 
indicate a deviation from the nomenclature described under 
General Method. Here, for example, an %ld song* Is so 
labdled because it is the real-word f^onetic equivalent of an 
old scng and is not exactly what was heard in the presenUtion. 

3The stimuli were, of course, in no sense true songs. However, 
we retain the same terminology as used in the other 
experiments. 

^Certainly not in Experiment 4 and 5, where the two constituents 
did not overlap in time. In Experiment 6, with simultaneous 
oontiguit/, masking-like effects could have existed between the 
melodies and texts. This perceptual interaction Is not what we 
mean by phy?' ^1 interaction, whidi could not have occurred in 
any of these experimenb. 
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Orthography and Phonology: The Psychological Reality of 

Orthographic Depth* 
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The representation of meaning by words is the 
basis of the human linguistic ability. Spoken 
words have an underlying phonologic structure 
that is formed by combining a small set of 
phonemes. Th^ purpose of alphabetic 
orthographies is to represent and convey these 
phonologic structures in a graphic form. Just as 
languages differ one from the other, orthographic 
systems represent the various languages' 
phonologies in different ways. This diversity has 
been a source of interest for both linguists and 
psychologists. However, while linguistic inquiry 
aims to explain and describe tht? origins and 
characteristics of different orthographies, 
psychological investigation aims to examine the 
possible effects of these characteristics on human 
performance. Consequently, reading research is 
oflen concerned with the question of what is 
universal in the reading process across diverse 
languages, and what aspects of reading are unique 
to each language's orthographic system. My first 
objective in this chapter is to outline the 
properties of different alphabetic systems that 
might affect visual word processing. The second 
objective is to provide some empirical evidence to 
support the claim that reading processes are 
determined in part by the language's orthography. 

Orthography, phonology and the mental 
lexicon 

The purpose of orthographies is to designate 
specific lexical candidates. There is, however, 
some disagreement as to how exactly this purpose 
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is achieved. The migor discussions revolve around 
the role of phonology in the process of visual word 
recognition. Clearly, phonologic knowledge of 
words generally precedes orthographic knowledge; 
we are able to recognize many spoken words long 
before we are able to read them. Only later, in the 
process of learning to read, does the beginning 
reader master an orthographic system based, in 
western languages, on alphaoetic principles. 

The recognition of a printed word is based a 
match between a letter string and a lexical 
representation. This match allows the reader 
access to the mental lexicon. However, since 
lexical access can theoretically be mediated by two 
types of abstract codes: orthographic and 
phonologic, a question remains about the exact 
transform of the printed word that is used in the 
process of visual word recognition: Is it 
informationally orthographic or phonologic? 

One account argues that access to the mental 
lexicon is mainly phonologic (e.g., Liberman, 
Liberman, Mattingly, & Shankweiler, 1980). 
According to this view, orthographic information 
is tvpically recoded into phonologic information at 
a very early stage of print processing. Thus, the 
lexical access code for printed word perception is 
similar to that for spoken word perception. The 
appeal of this model is its parsiirony and 
efficiency of storage; the reader does not need to 
build a visually coded grapheme-based lexicon, 
one that matches each of the words to spelling 
patterns hi the language. Instead, a relatively 
small amount of information — knowledge of 
grapheme to phoneme correspondences — can 
recode print into a form every reader already 
knows: the speech-related phonologic form. 

The second approach argues for the existence of 
an crthographic lexicon in addition to the 
phonologic one. According to this alternative viev/, 
lexical access for print can be achieved thr igh 
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either system. The extreme position of this 
approach holds Jiat lexical access is typica' / 
based only on the visual (orthographi ) 
information, and the word's phonology is retrieved 
after lexical access has occurred. Possible 
exceptions are novel or low-frequency words that 
may lark an (mtry in the visually based lexicon 
(Seidenberg, Waters, & Barnes, 1984; Seidenberg, 
1^>86). The appeal of such models is that visual 
lexical access is direct and, presumably, faster 
without the need for a mediating phonologic 
recoding. However, a model based on visual lexical 
representations must assume the existence of a 
memory stc/e of orthographicaJly coded words 
that parallels, in orthographic coding, most of the 
information the reader already possesses as 
phonologic knowledge. 

Clearly, the reader is well aware of both 
orthographic and phonologic structures of a 
printed word. Hence, the debate concerning 
orthographic and phonologic coding is merely a 
debate about priority: is phonology necessary for 
printed word recognition to occur, or is it just an 
epiphenomenon that results from it? In other 
words: is phonology derived pre-lexically from the 
printed letters and serves as the reader's code for 
lexical search, or, rather, is lexical search based 
on the word's orthographic structure while 
phonology is derived post-lexically? 

This question is often approached by monitoring 
and comparing subjects' rejponses in the lexical 
decision and the naming tasks. In lexical decision 
the subject is required to decide whether a letter 
string is a valid word or not, while in naming he is 
required to read the letter string aloud. In both 
tasks reaction times and error rates are measures 
of subjects' performance. Note that lexical 
decisions can be based on the recognition of either 
the orthographic or the phonologic structure of the 
printed word. In contrast, naming requires 
explicitly the retrieval of the printed word's 
phonology. Phonology, however, can be generated 
either pre-lexically by converting the letterr into 
phonemes, or post-lexically by accessing the 
mental lexicon through the word's complete 
orthographic structure, and retrieving from the 
lexicon the phonologic information. 

Since, at least theoretically, these two 
alternative processes «re available to the reader, 
one should coTipare their relative efficiency. It has 
been suggested that ^he abilily to rapidly generate 
prt-lexical phonology depends primarily on the 
reader's nuency, task characteristics, and the 
princed stimuli's complexity (see McCusker, 
Hillinger, and Bias (1981), for a review). In ur 



present context, only the factor of stimulus 
complexiQr is of a special interest. Complexity is 
generally related to the amount of effort needed 
for decoding a given word. One possible source of 
complexity that merits close examination is the 
lack of transparent correspondence between 
orthographic and phonologic subunits. Because 
the purpose of orthographic systems is the 
representation of phonology, whether the skilled 
reader uses this information or not, the relative 
directness and simplicity— tne transparency— of 
this representation can be of migor importance. 

Orthographic depth— Evidence from the 
shallow Serbo-Croatian 

Although the tran sparency between spelling and 
phonology varies within orthographies, it varies 
more widely between orthographies. The source of 
this variance can be often attributed to 
morphological factors. In some languages, (e.g., in 
English), morphological variations are captured by 
phonologic variations. The orthography, however, 
was designed to preserve primarily morphologic 
information. Consequently, in many cases, similar 
spellings denote the same morpheme but different 
phonologic forms: the same letter can represent 
different phonemes when it is in different 
contexts, and the same phoneme can be 
represented by different letters. The words Tieal" 
and Tiealth", for example, are similarly spelled 
because they are morphologically related. 
However, since in this case, a morphologic 
derivation resulted in a phonologic variation, the 
cluster "ea" represents both the sounds [i] and [ ]. 

Wthin this context English is often compared to 
Serbo-Croatian. In Serbo-Croatian, (aside from 
minor changes in stress patterns), phonology 
almost never varies with morphologic derivations. 
Consequently, the orthography was designed to 
represents directly the surface phonology of the 
language: Each letter denotes only one phoneme, 
and each phoneme is represented by only one 
letter. Thus, alphabetic orthographies can be 
classified according to the transparency of their 
letter to phonology corr'sspondence. This factor is 
usually referred to as ^'orthographic depth" 
(Mima, 1972; Liberman et al., 1980; Lukatela, 
Popadie, Ognjenovie & Tur/ey, 1980, Katz & 
Feldman. 1981). An orthography that represents 
its phonology in an unequivocal manner is 
considered shallow, while in a deep orthography 
the relation of orthography to phonology is more 
opaque. 

Katz and Zeldman (1981) suggested that the 
kind of code that is used for lexical access depends 
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on the kind of alphabetic orthography facing the 
reader. Shallow orthographies can easily support 
a reading process that uses the language's surface 
phonology. On the other hand, in deep 
orthographies, the reader is encouraged to process 
printed words 1^ referring to their morphology via 
their visual-orthographic structure. Note that 
orthographic depth does not necessarily h. to 
have a clear psychological reality. For exam^ 
has been argued that visual-orthographic accet 
faster and more direct than phonologic access 
(e.g., Baron & Strawson, 1976). By tiiis argument, 
it might be the case that in all orthographies 
words can be accessed easily by recognizing their 
orthographic structures visually. Therefore, the 
relation between spelling and phonology should 
not necessarily affect subjects' performance. 

Most of the earlier studies in word recognition 
were conducted with English materials. But in 
order to validate the psychological reality of 
orthographic depth experimenters turned to 
shallower orthographies like Serbo-Croatian. 

In addition to its direct spelling to phonology 
correspondence, the Serbo-Croatian orthography 
has an additional important feature: It uses eiUier 
the Cyrillic or the Roman letters, and the reader is 
equally familiar with both sets of characters. Most 
characters are unique to one alphabet or the 
other, but there are some characters that occur in 
both. Of these, some receive the same phonemic 
interpretation regardless of alphabet These are 
called COMMON letters. Others receive a 
different interpretation in each alphabet. These 
are known as AMBIGUOUS letters. Letters string 
that include unique letters can be read in only one 
alphabet. Similarly, letters string composed 
exclusively of common letters can be read in only 
one way. By c mtrast, strings composed only of 
AMBIGUOUS and COMMON letters are bivalent. 
They can be read in one way by trer^.ting the 
characters as Roman graphemes and in distinctly 
different way by treating them as Cyrillic 
graphemes. The two alphabets are presented in 
Figu.e 1. This specific feature of the Serbo- 
Croatian orthography was used in several studies 
in order to examine phonological processing in 
visual word recognition (Lukatela et al. 1980; 
Feldman & Turvey, 1983) 

Lukatela et al. (198C) investigated lexical 
dedsion performance in Serbo-Croatian, for wordr 
printed in the Cyrillic and the Roman alphabets. 
They demonstrated that words that could be read 
in two different ways were accepted more slowly 
as words than words that could be read in one 
way. Thus, the fact that one orthographic form 
had two phonologic interpretations slowed 



subjects' reaction times. This outcome suggested 
that the subjects were sensitive to the phonologic 
structure of the printed stimuli, while making 
lexical decisions. Lukatela et al. concluded that 
lexical decisions in Serbo-Croatian are necessarily 
based on the extraction of phonology from print. 
Similar results were found by Feldman and 
Turvey (1983) that compared phonologically 
ambiguous and phonologically unequivocal forms 
of the same lexical items. They have suggested 
that the direct correspondence of spelling to 
phonology in Serbo-Croatian results in an 
obligatory phonologic analysis of the printed word 
that determines lexical access. Moreover, in 
contrast to data obtained in English, the skilled 
reader of Serbo-Croatian demonstrates a bias 
towards a phonologically analytic strategy. 

Serbo-Croatian Alphabet 
— Uppercase — 



C/riiiic Common Roman 




Umaufly Am&i|uous Umgutly 

Cy/ithc iffttc's ictttfs Roman itttcrs 



Figure I. 

Evideiice from the deeper Hebrew 
orthography 

The term "orthographic depth" has been used 
with a variety of related but different meanings. 
Frost, Katz, and Ben tin (1987), suggested that it 
can be regarded as a continuum on which 
languages can be arrayed. They proposed that the 
Hebrew orthography could be positioned at the 
extreme end of this continuum, since it represents 
the phonology in an ambiguous manner. 

Hebrew, like other Semitic languages, is based 
on word families that are derived from tri- 
consonant roots. Therefore, many words share an 
identical letter configuration. The orthography 
was designed primarily to convey to the reader the 
word's morphologic origin. Hence, the letters in 
Hebrew represent mairly consonants, while the 
vowels are conveyed by diacritical marks 
presented beneath the letters. The vowels marks. 
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however, are omitted from regular reading 
material, and can be found only in poetry, children 
literature or religious scripts (for a detailed 
description of the Hebrew orthography see Navoi^ 
& Shimron, 1984). When the vowels aie absent, a 
single printed consonantal string us -ally 
represents several different spoken words 
(sometimes up to seven or eight words can be 
represented by a single letter string). The Hebrew 
reader is, therefore, regularly exposed to both 
phonologic and semantic ambiguity. An 
illustration of the Hebrew ambiguous unvoweled 
print is presentea in Figure 2. 

Although it is clear that the Hebrew 
orthography is an example of a very deep 
orthography, this is for different reasons than 
those presented in the context of the English vs. 
Serbo-Croatian distinction. English is labeled as 
deep because of the opaque correspondence 
between single graphemes and phonemes in the 
language's spelling system. In contrast, this 
correspondence is fairly clear in Hebrew, since the 
consonants presented in print, aside from a few 
exceptions, correspond to only one phoneme. 
However, because the vowels are absent, the 
Hebrew orthography conveys less phonologic 
information than many other orthographies. 
Hence, it is not just ambiguous, it is incomplete. 
This charecteristic of Hebrew, as I will argue, is 
not only Mnguistic but also psychological, in that it 



provides a possible explanation of differences in 
reading performance revealed in this language. 

In order to assign a correct vowel configuration 
to the printed consonants to form a vali^ word, the 
reader of Hebrew has to draw upoi lis lexical 
knowledge. The choice among the possible lexical 
alternatives is usually based on contextual 
information: the semantic and syntactic contexts 
constrain the possible vowel interpretations. For 
an unvoweled word in isolation, however, the 
reader cannot rely on contextual information for 
the process of disambiguation. 

Several studies have examined reading 
processes of isolated Hebrew words. Bentin, 
Bargai, and Katz (1984) examined naming and 
lexical decision for unvoweled consonantal strings. 
Some of these strings could be read as more than 
one word while some could be read as one word 
only. The results demonstrated that naming of 
phonologically ambiguous strings was slower than 
naming of unambiguous ones. In contrast, no 
effect of ambiguity was found in the lexical 
decision task. These results suggest that the 
reader is indeed sensitive to the phonologic 
structure of the orthographic string when naming 
is required. Contrarily, lexical decisions are not 
based on a detailed phonological aiialysis of the 
printed word in Hebrew. Note, that this outcome 
is in sharp contrast to the results obtained in the 
shallow Serbo-Croatian. 



Unvoweled 
form 



Voweled 
forms 



Phonemic safer 
transcriptions 



sicer 



Sccer 



suoar 



rafar 



star 



sapar 



Meanings 



ccok he told !e:l! 



was he count' 'corder barber 
"iOiC counted ccsi 



Figure 2. Unvoweled and v 3wei forms of the Hebrew tri<onsonanUi root -IBO (sfr). 
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Lexical decisions and naming of isolated Hebrew 
words were further investigated in a study by 
Bentin and Frost (1987). In this study subjects 
were presented with phonemically and 
semantically ambiguous consonantal strings. Each 
of the ambiguous strings could have been read 
either as a high-frequency word or as a low- 
frequency word, pending upon differ nt vowel 
assignments. Decision latency for the unvoweled 
consonantal string was compared to the latencies 
for both the high and the low-frequency voweled 
words. The results showed that lexical decisions 
for the unvoweled ambiguous strings \fere faster 
than lexical decisions for either of their voweled 
(therefore disambiguated) alternatives. This 
outcome was interpreted as evidence ^^at lexical 
decisions for Hebrew unvoweled words were given 
prior to the process of phonological 
disambiguation. The decisions were probably 
based on the printed word's orthographic 
familiarity (cf. Balota & Chumbley, 1984; 
C^iumbley & Balota, 1984). Thus, it is likely that 
lexical decisions in Hebrew involve neither a pre- 
lexical phonologic code, nor a post-lexical one. 
They are based upon the abstract linguistic 
representation that is common to several 
phonemic and semantic alternatives. 

These results are in contrast to studies on 
lexical ambiguity conducted in English. Lexical 
disambiguation in En^ish can be examined by 
employing homographs. Such studies have 
suggested that, at least initially, all meanings 
high- as well as low-frequency are automatically 
accessed in parallel. (Onifer & Swinney, 1981; 
Tanenhaus, Leiman & Se'denberg, 1979; and see 
Simpson, 1984, for a review). It should be noted, 
however, that in most cases the ambiguity in 
English resides only in the semantic and syntactic 
levels. With a few exceptions (e.g., *1bow*, "wind'O, 
English homographs have only one phonologic 
representation, and the reader, usually, does not 
have to access two different words related to one 
printed form. 

Although lexical decision in Hebrew might be 
based on an abstract orthographic representation, 
there is no doubt that the process of word 
identiflcation continues until one of several 
phonological and semantic alternatives are flnally 
accessed. This process of lexical disami^ ^^ation is 
more clearly revealed by using the naming task. 
Bentin and FVost (1987) investigated the proce:iS 
of selecting sp ecific lexical candidates by 
examining the naming latencies of unvoweled and 
voweled words. In contrast to the result obtained 
for lexical decisions, naming of ambiguous strin^js 



was found to be just as fast as naming the most 
frequent voweled alternative, with the voweled 
low-frequency alternative slowest. In the absence 
of constraining context, the selection of one lexical 
candidate for naming seems to be affected by a 
frequency factor: the high-frequency altema^ve is 
selec'jed first 

In a recent study (Frost & Bentin, in 
preparation), the processing of ambiguous 
consonantal strings in voweled and unvoweled 
Hebrew print was investigated by using a 
semantic priming paradigm. Subjects were 
presented with consonantal strings that could be 
read as a high- or a low-frequency word. These 
strings served as primes to targets that were 
related to one of the two alternative meanings. In 
order to minimize conscious attentional processes, 
targets followed the primes at a short SOA 
(stimulus onset asynchrony) of 100 ms (see Neely, 
1976; 1977, for a discussion of this point). It was 
assumed that if a specific meaning of the 
ambiguous consonantal string was accessed, it 
would be reflected by a semantic facilitation for its 
respective target. Thus, lexical decisions for 
targets that are related to that specific moaning 
would be facilitated. 

In contrast to studies on priming at short SOA's 
in English (e.g., Seidenberg, Tanenhaus, Leiman, 
and Bienkowsky (1982), no semantic facilitation 
for the low-frequency meanings was found in the 
unvoweled condition at 100 ms SOA. In the 
voweled condition there was 8 significant 
semantic facilitation for both the high- and the 
low-frequency meanings. This result suggests that 
in the voweled condition both the high-frequency 
and the low-frequency meanings of the 
consonantal strings were clearly depicted by the 
disambiguating vowel marks. 

Apparently, since the Hebrew reader almost 
never reads voweled print, he uses the 
consonantal information for accessing the lexicon. 
The phonologic representation of the high- 
frequency is selected first. Only at a second stage 
does the reader consider the low-frequency 
alternative. 

In conclusion, the deep unvoweled Hebrew 
orthography represents primarily the morphology 
of the Hebrew language, while phonemic 
information is conveyed only partially by print 
Consequently, in addition to a phonologic lexicon 
the Hebrew reader has probably developed a 
lexical system which is based on phonologically 
and semantically abstract consonantal strings 
that are common to several words. Lexical 
processing occurs, at a first phase, at this 
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morphological level. The reader accesses the 
abstract string and recognizes it as a valid 
morphologic structure. Lexical decisions are 
usually given at this early stage and do not 
necessarily involve deeper phonological 
processing. The complets phonological structure of 
the printed vtord can only be retrieved post- 
lexically, after one word candidate has been 
accessed. The selection of a word candidate is 
usually constrained by context, but in its absence 
it is based on frequency factors. 

Evidence from cross*langtiage studies 

Conducting experiments in different languages 
contributes important ineights concerning the role 
of pre- or post-lexical phonology in deep and 
shallow orthographies. Nevertheless, conclusive 
inferences cannot be drawn from these studies 
unless they are supported by results obtained in 
cross-language designs. Cross-language designs 
allow a direct comparison of native speakers' 
performances when the independent variables 
under investigation are controlled between 
languages, under identical experimental 
conditions. Hence, they can provide direct 
evidence concerning the effects of the 
orthograph/s characteristics on the process of 
word recognition. Obviously, cross-language 
designs are not without potential pitfalls; 
language differenies may be confounded with 
nonlinguistic factors. For example, differences in 
the subjects' samples due to motivation, 
education, etc., might interact with the 
experimental manipulation. The interpretation of 
the results, thus, i.inges on whether they are 
likely to be free of such confounding. 

Katz and Feldman (1983> compared semantic 
priming eftfects in naming and lexical decision in 
English and Serbo-Croatian. In this study, 
semantic facilitation was assumed to reflect 
lexical involvement in both tasks. The results 
denionstrated semantic facilitation for both lexical 
decision and naming in English. In contrast, 
semantic priming facilitated only lexical-decision 
in Serbo-Croatian. The authors suggested that 
phonology, which is necessary for naming, is 
derived post-lexically in English: hence the 
semantic facilitation in this task. In contrast, the 
extracti^r. of phonology from print in Serbo- 
Croatian does not call for lexical involvement but 
is derived pie-lexically. An additional finding in 
the study was a high correlation of reaction times 
for lexical-decision and naming in Serbo-Croatian 
without semantic context. This result was 
interpreted as evidence for an articulatoiy code 



used in this language for both lexical decisions 
and naming. 

The interpretation of differences in reading 
performance between two languages, lis reflecting 
subjects' use of pre- vs. post-lexical phonology, can 
be criticized on methodological grounds. The 
correspondence between orthography and 
phonology is only one dimension on which two 
languages differ. English and Serbo-Croatian, for 
example differ in their grammatical structures, 
and in the size and organization of their lexicon 
(Lukatela, Gligoijevid, Kostid & Turvey, 1980). 
These confounding factors, it can be argued, have 
affected subjects' performance in a similar way. 

Frost, Katz, and Bentin (1987) endeavored to 
address this possible criticism by comparing three 
languages simultaneously. They examined lexical 
decision and naming performance ^ Hebrew, 
English, and Serbo-Croatian. A. gh any 
comparison between two of the languages might 
be confounded by other factors, the set of 
confounds is different for each of the three 
possible pairs of comparisons. The only factor that 
displays consistency with the dependent measure 
is orthographic depth. Assuming that it is indeed 
the main factor that influences subjects' 
performance, predictions concerning a two 
languages comparison should be extended to the 
third language. But, note that while the 
probability of obtaining a predicted correct 
ordering of performance in the two languages is 
one out of two, the probability is one out of six, 
when three languages are compared. Thus, an 
appropriate ordering of subjects' performance in 
three languages would corroborate more strongly 
the psychological reality of orthoj^r-^phic depth. 

In their first experiment FrosT et al. (1987) 
compared, in each language, reaction times for 
both lexical decision and naming of high- 
frequency words, low-frequ'.ncy words, and 
nonwords, in English, Sorbo-Croatian, and 
Hebrew. The results showed that the lexical 
status of the stimulus (being a high- or a low- 
frequency word, or a nonword), affected naming 
latencies in Hebrew more than in English, and in 
English more than in Seri^o-Croatian. Moreover, 
only in Hebrew were the effects en naming very 
similar to the effects on lexical decision: Just as 
the lexical status of the stimulus affected lexical 
decisions, it also affected naming latencies. This 
outcome confirmed that in deep orthographies like 
Hebrew, phonology is derived post-lexically. In 
contrast, in a shallc^v orthography like Serbo- 
Croatian, naming performance is much less 
affected by lexical status. Given the direct 
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correspondence of orthography to ^nology, the 
extraction of phonology from print docS not call for 
lexical involvement 

In a second experiment, Frost et al. compared 
semantic priming effects in naming. Semantic 
priming usually facilitates lexical access. Hence, if 
the word's phonology is derived post-lexically in 
deep orthographies but pre-lexically in shallow 
orthographies, then naming should be facilitated 
more in Hebrew than in English, and again, more 
in English than in Serbo-Croatian. As 
hypothesized, the results revealed a relatively 
strong effect of semantic facilitation in Hebrew 
(21 ms), a smaller but significant effect in English 
(16 ms), and no facilitation in Serbo-Croatian 
whatsoever. These results were taken to strongly 
support the valid ^y of the orthographic depth 
factor in word recognition. 

In a recent study. Frost and Katz (1989) 
investigated how the different relations between 
spelling and phonology in English and Serbo- 
Croatian are reflected in the ability of subjects to 
match printed and spoken stimuli. They presented 
subjects simultaneously with words or nonwords 
in Uie visual and the auditory modality, and the 
subject's task was to judge whether the stimuli 
were the same or different. In order to carry out 
the matching process, the subjects had to mentally 
recode the print to phonology, and compare it to 
the phonologic information provided by the 
speech. Performance was measured in three 
experimental conditions: (1) Clear print and clear 
speech, (2) clear print and degraded speech, and 
(3) clear speech and degraded print. Within each 
language, the effects of visual and auditory 
degradation were measured relative to the 
baseline undegraded presentation. 

When the visual or the auditory inputs are 
degraded, subjects are encouraged to restore the 
part' 1 information in one modality by matching it 
to the clear information in the other modality. 
When subjects are presented with speech alone, 
restoration of degraded speech components has 
been shown to be an automatic lexical process (see 
Samuel, 1987). However, in addition to this 
ipsimodal restoration mechanism, subjects in the 
Frost and Katz experiment had the additional 
possibility of a compensatory exchange of speech 
and print information. Thus, the technique of 
visual and auditory simultaneous presentation 
and degradation provided insight concerning the 
interaction of orthography and phonology in the 
different languages. 

The results showed that for Serbo-Croatian, 
visual degradation had a stable effect relative to 
the baseline condition (about 20 ms), regardless of 



stimulus frequency. For the English subjects, the 
effect of visual degradation was three to four 
times stronger than for the Serbo-Croatians. The 
inter-language differences that were found for 
visual degradation were almost identically 
replicated for auditory degradation: The 
degradation effects in English were again three to 
four times greater than in Serbo-Croatian. Thus, 
the overall pattern of results demonstrated that 
although the readers of English were efficient in 
matching print to speech under normal conditions, 
their efHciency deteriorated substantially under 
degraded conditions relative to readers of Serbo- 
Croatian. 

These results were explained by an extension of 
an interactive model (see McClelland & 
Rumelhart, 1981; Rumelhart & McClelland, 
1982), that rationalizes the relativ .ship between 
the orthographic and phonologic sy&cems in terms 
of lateral connections between the systems at all 
of their levels. The structure of these lateral 
connections is determined by the relationship 
between spelling and phonology in the language: 
simple isomorphic connections between 
graphemes and phonemes in Serbo-Croatian, but 
more complex, many-to-one, connections in 
English. The concept of orthographic depth has 
direct bearing on the question of the relation 
between the phonologic and orthographic systems. 
Within such interactive models, the way in which 
connections are made between the two systems 
should be constrained by the depth of the 
orthography that is beir " modeled. In a shallow 
orthography, a graphemic node can be connected 
to only one phonemic node, and vice versa. Also, 
because words are spelled uniquely, each word 
node in the orthographic system must be 
connected to only one word node in the phonologic 
system. In contrast, in a deeper orthography, a 
graphemic node may be connected to several 
phonemic alternatives, a phonemic cluster may be 
connected to several orthographic clusters, and 
finally, a word in the phonologic system may be 
connected to more than one word in the 
orthographic system, as in the case of homophony 
(e.g., SAIL/ SALE) or, vice versa, as in the case of 
homography (e.g., WIND, READ, BOW, etc.). A 
representation of the different intersystem 
connections is demonstrated in Figure 3 for a 
word that exists in both the English and the 
Serbo-Croatian languages. The Serbo-Croatian 
word, KLOZET, is composed of unique letter- 
sound correspondences while the corresponding 
English word, CLOSET, is composed of 
graphemes, most of which have more than one 
possible phonologic representation, and phonemes, 
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most of which have more than one orthographic 
representation. 

SERBO-CROATIAN 



ORTHOCRA^HC NCTWORK I >MON0LOCJC NETWORK 




ENGLISH 



ORTMOCaAPHiC NETWORK I PHONOLOGIC NETWORK 




Figures, 

As shown in Figure 3, the simple isomorphic 
connections between the orthographic and the 
phonologic systems in a shallow orthography 
should enable subjects to restore both the 
degraded phonemes from the print and the 
degraded graphemes from the phonemic 
information, with ease. This should be true 
because, in a shallow system, partial phonemic 
information can correspond to only one, or at 
worst, a few, graphemic alternatives, and vice 
versa. In contrast, in a deep orthography, because 
the degraded information in one system is usually 
consistent with several alternatives in the other 
system, the buildup of sufficient information for a 
unique solution to the matching judgment is 
delayed, and the matching between print and 
degraded speech, or between speech and degraded 
print, is slowed. Therefore, the effects of visual or 
auditory degradation was greater for English than 
for Serbo-Croatian. 



The importance of orthographic depth: 
aitique and conclusions 

The psychological reality of orthographic depth 
is not unanimously accepted. Although it is 
generally agreed that the relation between 
spelling to phonology in different orthographies 
might affect reading processes to a certain extent, 
there is disagreement as to the relative 
importance of this factor. Seidenberg and his 
associates (Seidenberg et al. (1984); Seidenberg, 
1985; Seidenberg & Vidanovid, 1985) have argued 
that the primary factor determining whetlier or 
not phonology is generated prelexically is not 
orthographic depth, but word frequency. Their 
claim is tiiat in any orthogr aphy, frequent words 
are very familiar as visual patterns. Therefore, 
these words can be easily recognized through a 
fast visually-based lexical access which occurs 
before a phonologic code has time to be generated 
pre-lexically from the print. For these words, 
phonologic information is eventually obtained, but 
only postlexically, from memory storage. 
According to this view, the relation of spelling to 
phonology should not affect recognition of frequent 
words. Since the orthographic structure is not 
converted into a phonologic structure by use of 
graphemes-to-phonemes conversion rules, the 
depth of the orthography does not play a role in 
the processing of these words. Orthographic depth 
exerts some influence, but only on the processing 
of low-frequency words and nonwords. Since such 
verbal stimuli are less familiar, their visual lexical 
access is slower, and their phonology has enough 
time to be generated prelexically. 

In support of this hypothesis, Seidenberg (1985) 
demonstrated that there were few differences 
between Chined and English subjects in naming 
frequc*.. printed words. This outcome was 
interpreted to mean that in both logographic and 
alphabetic orthographies, the phonology of 
frequent words was derived postlexically, after the 
word had been recognized on a visual basis. 
Moreover, in another study, Seidenberg and 
Vidanovid (1985) found similar semantic priming 
effects in naming frequent words in English and 
Serbo-Croatian, suggesting again that the 
phonology of frequent words is derived 
postlexically, whatever the depth of the 
orthography. These results are consistent with a 
recent study by Carello, Lukatela, and Turvey 
(1988), that demonstrated associative priming 
effects for naming in Serbo-Croatian. Although 
Carello et al. did not manipulate word-fr^^quency 
in their study, their results question the 
inevitability of pre-lexical phonolog;' in a shallow 
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orthography; some lexical influence on word 
recognition may be possible. 

The resolution of these conflicting results is 
certainly not a simple task. A possible approach 
for examining the source of these differences ct ild 
consist of examining the experimental 
characteristics of these studies. One salient 
feature of most of the experiments discussed 
above is that they were conducted exclusively in 
the visual modality; that is, print alone was used 
to study the relationship between orthography 
and phonology. The experimental manipulation of 
phonology, therefore, has been indirect, having 
been der?ved from manipulating the orthography. 
One can criticize this methodology for studying 
the processing consequences of the relation 
between phonology and orthography: Because 
phonologic variation is typically obtained through 
orthographic variation, one can never be certain 
which of the two is controlling the subject's 
responses. A simple example can be given in the 
case of homophones. The common assumption that 
two homophones (e.g., bear/bare; sale/sail), share 
a phonologic but not ur orthographic structure 
(see for example, Rubenstein et al., 1971) is, in a 
way, misleading. Homophones always share 
printed consonants or vowels, and the task of 
disentangling the effect of the shared phonology 
from the shared orthography is complicated. 
Moreover, doubts have been raised about the 
adequacy of the lexical decision and naming tasks 
for measuring lexical as contrasted with prelexical 
involvement (see Balota & Chumbley, 1984, 1985). 

The technique of simultaneous visual and 
auditoiy presentation with degradation proposed 
by Frost and Katz (1989); (see also Frost, Repp, & 
Katz, 1988), furnishes partial solutions to these 
methodological problems. First, phonology is 
presented to the subjects through a spoken word 
and does not have to be inferred from print More 
importantly, by degrading the print or the speech, 
the technique affords a way to independently 
manipulate the perception of orthography and 
phonology. By using this method. Frost and Katz 
(1989) have demonstrated that orthographic depth 
and not word frequency is the primary factor that 
affects the generation of pre- or post-lexical 
phonology. 

However the assessment of the role of 
orthographic depth in reading cannot be resolved 
solely with methodological arguments. One 
important conclusion from two decades of studies 
in reading is that the reader uses various 
strategies in processing printed words, (see 
McCusker et al., 1981). These strategies have 



been shown to depend on factors like orthographic 
regularity (Parkin, 1982), word frequency 
(Scarborough, Cortese, & Scarborough, 1977), 
ratio of words and nonwords (Frost et al., 1987), or 
special demand characteristics of the 
experimental task (e.g., Spoehr, 1978). By the 
same argument, one cannot fully account for the 
reader's processing without taking into 
consideration the reader's linguistic environment 
Although the skilled reader in every orthography 
becomes familiar with his own language's 
orthographic structures, I suggest that the depth 
of tho ortiiography is an important factor. 

One common misinterpretation of claims 
concerning the importance of orthographic depth 
is to view a language's orthographic system as 
constraining the reader to only one form of 
processing. For example, although Frost et al. 
(1987) have shown no semantic facilitation for 
naming a specific set of stimuli in Serbo-Croatian, 
it does not follow that Serbo-Croatian readers 
never generate phonology post-lexically. One 
should always give the reader credit for extensive 
flexibility. If the words in the experiments were 
closely associated, even the Serbo-Croatian reader 
might find the extraction of phonology post- 
lexically more efficient then a pre-lexical 
extraction. But under similar conditions, relative 
differences should be found between deep and 
shallow orthographies. 

In conclusion, the argument concerning the 
effect of orthographic depth is an argument 
concerning the priority of using a specific 
processing strategy for generating phonology in 
different orthographies. Research conducted in 
English Serbo-Croatian and Hebrew suggests that 
orthographic depth has indeed a strong 
psychological reality. 
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Phonology and Reading: Evidence L .m Profoundly 

Deaf Readers* 



Vicki L. Hancon^ 



The prelingually, profoundly hearing-impaired reader uf English is at an immediate 
disadvantage in that he or she must read an orthography that was deagned to represent 
the phonological structure of EngUsh. Can the deaf reader become aware of this structure 
in the absence of significant auditory input? Evidence from studies with deaf college 
students will be considered. These studies ii.aicate that successful deaf readers do 
appreciate the phonological structure of words and that they exploit this knowledge in 
reading. The finding of phonological processing by these deaf readers makes a strong case 
for the importance of phonologiod sensitivity in the acquisition of skilled reading, whether 
in hearing readers or deaf readers. 



In ;he normal course of events, children are 
fluent speakers and listeners of their native 
tongue, English: before they begin learning to 
read. In this chapter, I am concerned with a 
population of readers for whom this is not the 
case. This population is prelingually, profoundly 
hearing-impaired. For these deaf readers, speech 
and lipreading are difficult to acquire and require 
years of instruction. 

Research on reading has indicated that buildir j 
on a spoken language found'^tion is a critical 
feature of reading and that in order to use an 
alphabetic orthography, such as English, to best 
advantage, the reader must go beyond the visual 
shape of words to apprehend their internal 
phonological structures (Liberman, 1971, 1973). 
Despite their extensive experience in using the 
phonology in everyday speech, evidence presented 
elsewhere in this monograph argues that hearing 
children who are poor readers may have 
phonological deficits that underlie iheir reading 
problem. These children have difficulty in setting 
up phonological structures, in apprehending such 
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structures in words, and in usirig a phonetic code 
for the storage and processing of words in working 
memory. The phonological deficits of these 
children may be fairly subtle, however, such that 
no difficulty in the child's speaking ability or 
listening comprehension may be leadily apparent. 

Tf in the hearing population even subtle phone- 
log., il deficits are associated with poor reading, 
then how is it possible for profoundly deaf indi- 
viduals to read? One might suppose that deaf 
persons would have difficulty with reading, and, 
indeed, his is the case. Sur \ have consistently 
shown that hearing-impaired students la^ signifi- 
cantly behind their normally-hearing counterparts 
in reading achievement (Conn d, 1979; Furth, 
1966; Trybus & Karchmer, 1977). Although it is 
typical to state, based on Uiese surveys, that the 
average hearing-impaired student graduating 
from high school reads only at about the level of a 
hearing child of fifth grade, that statistic obscures 
the even greater reading deficiency of profoundly 
hearing-impaired students, that is, those who 
could be considered truly deaf. For them, the 
statistics are even more discouraging: Profoundly 
deaf students gradr ang from high school read, 
on the average, only at the level of a normally 
hearing child of third grade (Conrad, 1979; 
Karchmer, Milone, & Wolk, 1979). Remember, 
though, that these reading achievement scores 
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represent only a population average. V/e find, for 
example, that measures of reading achievement 
for deaf students attending Gallaudet University 
may average seventh to tenth grade, ivith som« 
students reading at above twelfth grade level (see, 
for example, Hanson, 1988; Hansen & Feldman, 
1989; Hanson, Shankweiler & Fischer, 7983; 
Reynolds, 1975). 

These statistics on reading achievement levels 
of deaf students have been used fay investigators 
to argue two opposing views of the relationship 
between phonology and reading. The assumption 
common, to both views is th&t the hearing 
impairment of these students prevents access to 
English phonology. In one view, access to 
phonological information is believer' to be crucial 
for reading and the generally iow reading 
achievement levels of deaf students are believed to 
reflect its importance. Because these readers 
presumably lack access to English phonology, 
their acquisition of reading suffers as a 
consequence. The second view takes the position 
that access to pho.iological information is not 
important in r'sading. The fact that some deaf 
individuals are able to attain fairly high reading 
levels is take.i as evidence of this. Again, the 
assumption is that these readers, due to their 
hearing impairment, lack access to phonological 
information. Consequently, if they succeed at 
reading it must be without benefit of phonology. 

Neither of these positions need be correct, 
however, in their interpretation of the reading: 
achievement of deaf students. Deaf readers, 
despite their hearing impairment, might have 
access to phonology that could be u^ed to support 
skilled reading. To assume that deaf readers lack 
access to phonology because of their deafness 
confuses a sensory deficit with a cognitive one. 
While the term phonological is of^en used to mean 
acoustic/auditory, or sound this usage reflects a 
common misunderstanding of the term. 
Phonological units of a language are not sound**, 
but rather a set of meaningless primitives out of 
which meaningful units are formed. These 
primitives are related to gestures articulated fay 
thb vocal tract of the sp^^aker (see Liberman & 
Mattingly, 1985 for a more detailed discussion).! 
In the case of English, the deaf individual could 
learn about the phonology of the language from 
the motor events involved in speech p/oduction, 
through experience in lipreading, or from 
experience with the orthography. 

As a rule, deaf children in English speaking 
countries receive intensive instruction in speaking 
and lipreading. This is true both in lichools that 



use an oral educational approa'^h (with speech 
being the only means of communication used in 
the classroom) and in schools that use a 
simultaneous or total communication approach 
(with speech being accompanied by manual 
communication in the classroom). Through tY 
speech training, prelingually, profoundly deaf 
individuals develop varying skill in speaking and 
lipreading. Although some of these individuals 
develop quite good speaking and lipreading skills, 
most do not (Conrad, 1979; Smith, 1975). Speech 
training, nevertheless, does provide the deaf 
individual with a means of learning English 
phonology. 

Speech intelligibility does not necessarily 
indicate, however, the extent to which a deaf 
reader has access to phonological Information. 
Intelligibility reflects the degree to which a deaf 
speaker's speech can be understood by a listener. 
Among the things that can effect intelligibility are 
phonation and prosodic informal )n. While such 
features clearly add to intelligibility, they may not 
be relevant for an individual'; internal 
manipulation of phonological informadon. In any 
event, it cannot simply be assumed that deafness 
necessarily blocks access to English phonology. 
This is a question for empirical investigation. 

How do congenitally, profoundly deaf readers 
who read well manage to do it? That is the 
question to be addrersed here. It is possible that 
deaf readers read English as if it were a 
logographic language; namely, treating printed 
English words as visual characters, without 
taking 'nto account the correspondences between 
the printed letters and the phonological structure 
of words. Research on the reading of Japanese and 
Chinese, however, has suggested that for 
logograohic languages, alphabetic 
languages, phonetic receding V v;urus is one 
component of a linguistic processing system 
required for the task of reading (Erickson, 
Mattingly, & Turvey, 1977; I^larn, 1985; Tzeng, 
Hung, & Wang, 1977). For example, Tzeng et al. 
(1977) found that the phonetic composition of 
printed Chinese characters inflvienced sentence 
processing for skilled readers of Chinese. These 
investigators concluded that even in cases . >iere 
lexict! access is possibL' without phonological 
mediation, a phonetic ^'^de is still required for 
effective processing in w^ *'king memory. 

The deaf individuals who participated in the 
studies to be discussed here had backgrounds in 
which sign was used predominantly. That is, they 
generally had or were receiving instruction using 
sign language. Most of these individuals 
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considered American Sign Language (ASL), to be 
their preferred means of communication. ASL is 
the common form of communication used by 
members of deaf communities across the United 
States and parts of Canada. It is a visual-gestural 
language that has developed inoependently from 
spoken languages and from other signed 
languages. For many of the subjects in the studies 
reported here, ASL was their first language, 
having been learned as a native language from 
deaf parents. These subjects were typically 
undergraduates at Gallaudet University. All were 
profoundly deaf. These deaf subjects, therefore, 
can be characterized as having higher than 
a\ )rage reading levels a^d not being exposed to 
an exclusively oral background 

Findipgs on phonetic coding in working 
memoiy 

Evidence reviewed elsewhere in this monograph 
indicates that hearing children who are poor 
readers have a language deficit that is specific to 
the phonological domain. For example, in tests of 
short-term memory hearing poor readers recall 
fewer items overall and display less sensitivity to 
rhyme than hearing good reader^ (see, for 
example, Shankweiler, Uberman, Mark, Fowler, 
& Fischer, 1979). That is, on rhyming lists, the 
accuracy of the good readers is t3^ically worse 
than 01 nonrhyming lists. In contrast, the 
accuracy of the poor readers is about the same for 
rhyming and nonrhyming lists. The good readers' 
differential performance on recall of rhj'ming and 
nonrhyming strings has been taken to mean that 
these readers convert the printed letters into a 
phonetic form and retain this phonetic 
information in memory. Accordingly, the finding 
that poor readers are not much affected by 
rhyming manipulations suggests that they are 
less able to use the phonetic information. 

Is a phonetic code uniquely well-suited to the 
task of reading? To examine this qu ion, we 
asked whether for deaf signers a differant 
language code, one iiased on the structure of signs, 
could provide an alternative coding system for 
reading. 

A sign of ASL is produced by a combination of 
the jormational parameters of handshape, place of 
articulation, movement, and orientation (Battison, 
1978; Stokoe, Casterline, & Croneberg, 1965). 
Evidence indicates that when 8ign8 are presented 
for recall in a short-term memory task, the signs 
are coded in terms of these formational 
parameters. The first line of evidence comes from 
studies of intrusion errors in the recall of lists of 



signs (Bellugi, Klima, & Siple, 1975; Krakow & 
Hanson, 1985). In the study by Bellugi, Klima, 
and Siple, lists of spoken words signs were 
presented to hearing adults and deaf adults, 
respectively. The subjects wer vsked for 
immediate written recall. Intrusion v ^ *s for the 
hearing group were confusions of piionetically 
similar words. For example, a subject might write 
the word boat instead of the word presented, 
"vote." Errors for the deaf subjects, however, were 
completely different; they were confusions of the 
formational parameters of signs. As an example, a 
subject might write the word egg instead of the 
sign presented, NAME. The signs corresponding 
to the words name and egg differ only in terms of 
the movement of the hands (see Figure D* 








Figure 1. Foimationa*.!" similar signs. Shown left right 
from .he top are KKIFE, EGG, NAME, PLUG, TRAIN, 
CHAIR, TENT, S ^ LT. (From Hanson. 1982, p. 574). 
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In addition, there is evidence that lists of 
format'onally similar signs can produce 
performance decrements in serial recall tasks 
(Hanson, 1982; Poizner, Bellugi, & Tweney, 1981). 
For example, shown in Figure 1 is a set of 
formationally similar signs that I used in one such 
study (Hanson, 1982). Shown here are, left to 
ri^t from the top, the signs for the words KNIFE 
EGG, NAME, PLUG, TRAIN, CHAIR, TENT, and 
SALT. On each trial in that i^!:periment, deaf 
college students were shown five of the signs from 
this set, and were asked to remember the five 
signs in order. Results of that experiment 
indicated that fewer signs were recalled from lists 
made up of signs from this formationally similar 
set than from lists made up of signs from a 
formationally unrelated icor.trol) set 

Despite such evidence that sign coding can thus 
mediate short-term recall of signSf evidence from 
other research does not support the notion that a 
sign code can serve as a viable code in the service 
of skilled adult reading. In another condition of 
that study (Hanson, 1982), I tested deaf college 
students in a short-term recall task of printed 
words. There were three types of word lists of 
interest here: rhyming words, orthographically 
similar words, and words whose signs were 
formationally similar. The words in the rhyming 
set wc^e two, blue, who, chew, shoe, through, jew, 
and you. The words in the orthographically 
SLuilar set were visually similar. The words i^ 
th.s set were bear, meat, head, year, learn, peace, 
break, and dream. While argument could be taken 
vitii the degree of visual similarity of the words in 
this list, it is at least true that these words are 
more similar visually than were the words in the 
rhyming list. The words in this visuallj' similar 
set served as a control to ensure that any 
potential rh'^fie effects could not be attributed to 
the visual similarity of the printed words. The 
words in the formationally similar set were words 
whose corresponding signs were formationally 
similar. These were the words knife, narr^e, plug, 
train, chair, tent, and salt, whose corresponding 
signs are shown in Fi^re 1, Each of these sets 
was paired with a control set of words. Of interest 
in this experiment was any differences in ability 
to recall an experimental and control set 

The pattern of results in that experiment clearly 
indicates the use of phonetic coding by the deaf 
subjects. Whereas these subjects recalled 65.4 
percent of the lists in the control condition, they 
recalled only 47.6 peu ent of the lists in the 
phonetically similar (rhyming) condition. There 
was, however, ro decrement on the visually 



similar lists, indicating that tne decrement on the 
rhyming lists was due to phonetic, not visual, 
similarity. 

Interestingly, I found no evidence that the deaf 
college students I tested were using sign coding. 
Their performance on lists of words having 
formationally similar signs and on the control lists 
was comparable (52.9 percent of the control lists 
recalled vs. 51.4 percent of the formationally 
similar V jts recalled). Converging evidence from 
later research supports this finding that the better 
deaf readers do not use sign coding in their recall 
or reading of printed Endish woi s (Lichtenstein, 
1985; Treiman & Hirsh-Pasek, 1983). More than 
100 years ago, Burnet (1854) argued that sign 
codi.ig would be a ponderous strategy for the deaf 
readers, and, thus, limited in its use to the poorer 
readers. By finding that the better deaf readers do 
not use sign coding when processing printed 
English words, current research on the cognitive 
processing of deaf readers is consistent with 
Burnet's speculations. 

The finding that the better readers were using 
phonetic coding is reminiscent of the results 
reported by R. Conrad (1979) in a very large scale 
stud^ of deaf and hearing-impaired students in 
England and Wales. Conrad tested these students 
in a short-term memory task of rhyming and 
nonrhyming lists of printed letters. Comparing 
their performance on this memory task with 
measured reading ability, Conrad found that the 
better readers in his deaf population recalled 
fewer rhyming than nonrhjoning lists. Thus, the 
better readers were using phonetic coding. 

My study with deaf signers (hanson, 1982) took 
Conrad's findings one step further. Conrad's 
subjects were from schools that generally 
subscribed to an or- ^ philosophy of education. As a 
result, phonetic coding was the only language 
form available to the subjects. In my study, the 
deaf subjects had sign language readily available 
to them. In fact all of my deaf subjects had deaf 
parents and reported ASL to be their first 
language. Yet, these signers, aj skilled deaf 
readers, used phonetic coding in that memoiy 
^sk, indicating the importance of phonetic coding 
in short-term retention of printed material. 

Sensitivity to the phonological structure of 
English words 

Additional evidence that deaf readers can access 
phonological information about English words is 
provided by studies of .'ndividi/*l v;ord reading. 
For example, one experimental paradigm that has 
been shown to produce piionological effects with 
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hearing readers uses a lexical decision task in 
which two letter strings are shown to the subjects 
on ^very irial, one string dbove the other (Meyer, 
Schvaneveldt, & Ruddy, 1974). The subjects must 
decide whether or not both of the letter strings on 
a trial are real English words. 

In a series of three experiments, we used this 
paradigm with deaf college students (Hanson & 
Fowler, 1987). There were two types of word pairs 
of particular interest. As shown in Table 1, the 
first was pairs in which the two words rhymed. 
These rhyming words were spelled alike except for 
the first letter. The second type of word pair of 
interest was pairs in which the two words were 
spelled alike except for the first letter, but the 
pairs did not rhyme. It is apparent that the 
rhyming and nonrhyming pairs were equally 
similar orthographically, differing only in the 
phonological similarity of the two members of a 
pair. We tested whether there was any difference 
in the response times to the vhyming and 
nonrhyming pairs. Since, however« response times 
to words vary with word familiarity and 
orthographic recrularity, it was not possible in this 
study to simp.^ compare the responses to the 
rhymr j and nonrhyming pairs. To eliminate 
fafr arity and regularity as confounding factors, 
t J control conditions were used. Word pairs in 
the control conditions used the same words as in 
the rhyming and nonrhyming pairs, but were 
repairings of these words. Thus, the control pairs 
for the rhyme condition were the same words as in 
the rhyme condition, just paired now with 
different words. For example, in the rhyme 
condition, the words aave-wave and fast-past were 
paired together, while in the rhyme control "^ave- 
past were paired together and fast-wave were 
paired together. Similarly, the control pairs for 
the nonrhyme condition were the same words as 
in the nonrhyme condition, just paired with 
different words. By comparing each word in the 
rhyme and nonrhyme condition with itself in a 
control condition, any effects of word frequency 
and regularity were eliminated. 



Table h Rhyming and nonrhyming pairs and their 
matched controls. 



Rhyming Pairs 


NoDrhymlng Pain 


save^wave 


haye<ave 


fast'pcst 


last-east 


Rhyming Controb 


Nonrhyming Controls 


save-past 


have^east 


fast'wave 


last-cave 



Source of data: From Hanson k Fowlet, 1987, Erperiment 2. 



The predictions for this experiment are shown 
ill Table 2. If . eaders in this task did not access 
phonological information, then there should have 
been no effect due to the phonological 
relationships between words in a pair. That is, if 
the readers were using 'olely orthographic 
information, then the first equation shown here 
would hold; nameV» that the difference between 
response times to rhyming pairs and the rhyme 
controls would equal the difference between 
response times to nonrhyming pairs and their 
control^. Thus, response times would be the same 
whether or not th^ wot ds of a pair rhymed. 

Table L Predictions in the Hanson & Fowler, 1987 
Study. 



If ORTHOGRAPHIC CODING, then: 

Control - Rhyming « Control - Nonrhyming 

If PHONOLOGICAL CODING, then: 

Control * Rhyming ^ Control - Nonrhyming 



If, however, readers ivere accessing phonological 
information, then there tvould be a difference in 
response times as a function of phonological 
relationships between words in a pair. Access to 
phonological information would be indicated if the 
second equation held; namely, that e two 
differences in response times would not ^ qual. 
In that event, the response times wc ^id be 
affected by the rhyming manipulation. 

For the deaf college students we tested, there 
tvas an effect of the phonological relationship 
between the words in a pair. Shown in Table 3 are 
the response times from one experiruent of that 
study (Experiment 2). Response times were faster 
fc/ the rhyming pairs th!:.< for the matched 
controls. In contrast, response times were slower 
on the nonrhyming pairs than on the matched 
controls. Since the rhyming and nonrhyming pairs 
were equally similar orthographically, this 
significant difference in response times for the 
rhyming and nonrhyming pai^s was not due to 
orthographic influences. As a consequence, the 
difference in response times to the rhyming and 
nonrhyming pairs could be unambiguously 
attribnted to the discrepant phonological 
structures. Thus, these good deaf readers, like 
hearing readers, accessed phonological 
information when reading words. 

Most impressive is the tmding that the deaf 
subjects in that study j^ere not only accessing 
phonological information, but were doing so in a 
highly speeded task. It might be supposed that 
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deaf readers would be able to access phonological 
information only in situations in which they have 
time to laboriously recover learned pronuncia- 
tions. In this research, how'jver, we found that 
they accessed phcaological information quite 
rapidly, suggesting that access.ng such 
information is a fundamental property of reading 
for these skilled reader*. 

Table 3. The response time (RT) d^erence for the 
rhyming and nonrhyming pairs and their matched 
controlsfor deof college students. 



CdnODl - Rhyming (52 msy ^ Control - Nomhyming (-15 ms) 
Source of data: From Hamon Fowler, 1987, Experiment 2^ 

In more recent work, we have f nd other 
eviden je th^^ skilled deaf readers are nsitive to 
the phonological structure of words. For example, 
deaf college students, when asked to think of 
words that rhyme with a specific target word have 
been found to be able to do so (Hanson & McGarr, 
1989). In addition, we have found that deaf college 
students are able to apply principles of grapheme- 
phoneme correspondence in generating the correct 
pronunciation of letter strings not previously 
encountered — a skill underlyin£' the acquisition of 
new words. In this latter task (Hanson, 1989), . 
tested these students on their reading of 
orthographically possible nonwords; that is, 
pseudowords. The critical test was between 
pseudowords such as flaim that were 
homophonous with an actual English word (flame) 
and cmtrol pseudowords. These controls were 
orthographically-m&trhed pseudowords that were 
not homophonous with an actual English word 
(e.g., proorn). Examples of stimuli from the task 
are shown in Table 4. 



Table 4. Examples of pseudohomophcnes and control 
pseudowords. 



Enpi-^ Word 


Pseudohomophone 


Control 


flame 


fltim 


pTDom 


dog 


daug 


grine 


spoon 


spune 


fofh 


taU 


tMl 


braie 


home 


hoem 


^pii! 


blue 


bkx) 


nole 


noon 


nune 


fine 



Source: Selection of stimulus items from Hanson« 1989, btt;ed 



on Macdonali 1988. 



In two experiments us'ng different lists of 
pseudowords, a paper and pencil task tested 
whether subjects could identify which of several 
pseudowords were homophonous with English 
words. The actual instructions to subjects were 
that they were to indica:« whether or not each of 
the 'nonsense words" was pronounced like a real 
English word. In both experiments, deaf college 
students were able to correctly make this 
judgment with better than chance accuracy, 
althou^ they were not as a«.C!irate as the hearing 
subjects. As an additional aspect of this 
pseudohomophone task, sutgects in one of the two 
experiments were asked to indicate which English 
word they thought a pseudoword was pronounced 
Uke, if th^ had indicated that they thoui^t it was 
pronounced like one. In this second task, deaf 
subjects were usually able to supply the correct 
English word. 

Studies on individual word reading thus 
indicate that it is possible for deaf readers to have 
access to English phonology. This does not mean 
that such access is easy for these readers. Nor 
does it mean that all or even most deaf readers 
are able to use this information. The point, rather, 
is that hearing loss alone does not preclude access 
to phonology. In addition, it is important to note 
and that the better dea^ readers geneially take 
advantage of this phonoloT" al information. 

Why phonological coding? 

In sum, the evidence, which .las *^een 
summarized here, indicates that it is possible for 
deaf readers to use phonology. The use of 
phonological information tends to be characteristic 
of deaf good readers, whether they are beginning 
readers (Hanson, Liberman, & Shankweiler, 

1984) , high school students (Conrad, 1979; 
McDermott, 1984), or college students (Hanson, 
1982; Hanson & Fowler, ? ;87; Lichtenstein, 

1985) . Why would the better deaf readers use this 
type of linguistic information when r jading? One 
possibility has to do with the structural properties 
of particular languages. In English, where word 
order is relatively fixed, grammatical structuring 
is essentially sequential. A phonological code may 
be an efficient medium for retaining the 
sequential information that v represented in 
English. 

Deaf individuals have specific difficulty in the 
recall of temporally sequential information 
(Hanson, in press). Studies have consistently 
found that the measured memory span of deaf 
individuals is shorter than that of hearing persons 
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(see, for example, Bellugi et al., 1975; Blair, 1957; 
Belmont & Karchmer, 1978; Conrad, 1979; 
Hanson, 1982; Kyle, 1980; Pintner & Paterson, 
1917; Wallace & Corbhllis, 1973). It is important 
to note that this finding of a short span applies 
not only to the English materials (e.g., lists of 
words, letters, or digits), but also applies to 
studies that have measured serial recall of signs. 
Fai'^ly typical results were found in the Bellugi et 
al. (1975) study, in which deaf adults' correct 
serial recall of signs reached an asymptote with a 
List length of four signs, while the hearing subjects 
reached asymptote with lists of six words. Thus, 
the differences in memory span found between 
hearing and deaf individuals appear to be due not 
simply to unfamiharity with the English material; 
rather, they appear to be related to cognitive 
processes involved in short-term memory for 
linguistic materials, in general. 

Ability to maintain a sequence of words in short- 
term memory is related to the use of phonological 
coding. That is, studies with orally trained 
subjects (Conrad, 1979), native signers of ASL 
(Hanson, 1982), and subjects mixed in terms jf 
their educational and linguistic backgrounds 
(Lich ten stein, 1985), have all fourd strong 
correlations between the magnitude of Uie rhyme 
effect for deaf subjects and measu ed memory 
span. In these studies, the larger the «hyme effect 
for a deaf subject, the larger that subject's 
memory span. In contrast, no conel^^tion between 
use of manual coding and measured memory span 
has been established (Li^htenstein, 1985). 

Given this relationship between serial recall 
ability and phonological coding, we have 
suggested tlM one reason the skilled deaf reader 
uses phonological coding may have to do with the 
critical syntactic role played by sequential 
structuring in English (Hanson, 1982; Lake, 1980; 
Lichtenstein, 1985). This analysis suggests that 
an issue to be faced by teachers is how to educate 
deaf students to process a highly temporally 
structured language such as English. 

Deaf readers and phonology 

It is notable t^ at the subjects in the studies 
discussed v. ere not generally from oral 
backgrounds. In some cases, subjects were 
expressly selected because they were native 
aignen of ASL. Yet, even these subjects, if skilled 
readers, were found to be ucing phonological 
information in the reading of English, rather than 
referring to ASL 

A discussion of phonological sensitivity in deaf 
readers always leads to the questioji of how this 



sensitivity is acquired. It ib likely that 
congenitally, profoundly deaf readers acquire 
phonology from a combination of three sources: 
experience with the orthography throixgh reading, 
experience in speaking, and experience in 
lipreading In many of the studies discussed here, 
there was evidence of phonological processing for 
deaf subjects whose speech was not intelligible. 
That even these subjects use phonological coding 
suggests that deaf individuals' ability to use 
phonological information when reading is not well 
reflected in the intelligibility ratings of their 
speech. Further research is needed to determine 
the type of language instruction capable of 
promoting access to the speech skills most 
relevant to reading. 

When this chapter was first planned, it was 
titled "Is reading different for deaf individuals?" 
The answer appears to be both yes and no. Clearly 
the answer is yes m the sense that deaf readers 
wiP bring to the tank of reading very dif^'ere t sets 
of language experiences than the hearing child. 
These differe. ces will require special instruction. 
But, the answer is also no. The evidence indicates 
that skilled deaf readers use their knowledge of 
the structure of English when reading. Althou^^ 
sign coding, in theory, might be used as an 
alternative to phonological coding for deaf signers, 
the research using various short-term memory 
and reading tasks has found little evidence that 
words are processed with reference to sign by the 
better deaf readers. Rather, the better deaf 
readers, like the better hearing readers, have 
learned to abstract phonological information from 
the orthography, despite congenital and profound 
hearing impairment 

The finding of phonological processing by deaf 
readers, particularly deaf readers skilled in ASL, 
makes a strong case for the importance of 
phonological sensitivity in the acquisition of 
skilled reading, whether the reader is hearing or 
deaf For deaf readers, the acquisition and use of 
phonological information is extremely difficult. 
They would be expected to use alternatives scrh 

s visual (orthographic) or sign strategy, if such 
were effective. Yet, the evidence indicates that the 
successful deaf readers do not wly on these 
alternatives. 
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FOOTNOTES 

•In D. Shankweiler k \. Y. Uberman (Eds.), Phonology and reading 
disabaity: Solving the reading puzzU (pp. 69-89). Ann Arbor 
University of Michigan Press. 0989). 

^ow at IBM Research Division, ITiomas J. Watson Research 
Center, Yorktown Heights, New York 
^Ih'i term phonology need not be limited to use with spoken 
languages. In the case of American Sign Language, foic 
example, the term phonology has been used to describe the 
linguistic primitives related to the visible gestures by the 
hands, face, aiul body of the signer. In the present chapter, 
however, phonology will be restricted in reference only to 
features rriated to spoken languages. 
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Syntactic Competence and Reading Ability in Children* 



Shlomo Bentin,t Avital Doutsch^t and Isabella Y. Ubermantt 



The eiTect of syntactic context on auditory word identification and on the ability to detect 
and correct qmtactic erron in speech was examined in severely reading disabled children 
and in good and poor readers selected from the normal distribution of fourth graders. The 
poor readers were handicapped when correct reading required analysis of Sie sentence 
context However, their phonological decoding ability was intact. Identification of words 
was less affected by syntactic context in the severely disabled readf^rs than in either the 
good or poor readers. Moreover, the disabled readers were inferior to good readers in 
judging the S3rntactical integrity of spoken sentences and in their abihty to correct the 
syntactically aberrant sentences. Poor readers were similar to good readers in the 
identification B^d judgment tasks, but inferior in the correction task. The results suggest 
that the severely disabled readers were inferior to both good and poor readers in syntactic 
awareness, and in ability to use syntactic rules, while poor readers were equal to good 
readers in syntactic awareness but were relatively impaired in using syntactic knowledgi^ 
productively. 



r /uv^nt reading involves a complex interaction of 
sev<;ral parallel processes that relate visual 
graphemic stimuli to specific entries in the lexicon 
and combine the semantic and syntactic 
information contained in those entries to 
apprehend the meaning of sentences. Some of 
these processes relate to the decoding of the 
phonological code from print while others relate to 
the assignment of meaning to the phonological 
units. Although the decoding of the phonological 
code can, in principle, be based solely on 1x>ttom- 
up" application of grapheme-to*phoneme 
transformation rules, it is well documented that 
this process is supported '^p-down'' streaming 
of lexical knowledge and contextual information. 
The common denominator of the 'iMttom-up" and 
''top-down" processes in reading is that both are 
components of the human linguistic endowment 
(see Perfetti, 1985; Rozin & Gleitman, 1977). 

This itudy was lupported by the Iirsel Foundation! 
Tniitees to Shlomo dentin. Issbelle Libennan was tuppoited 
in part by National Institute of Child Health, and Human 
Development Grant HD-019d4 to Haskin« Laboratories. The 
useful oonusents of Anne Fowler are much appreciated. We 
grateftiUy acknowledge the cooperation of the principals and 
teachers of the integrative schools "Luria* and "Stoane* in 
Jerus«*lem, and of the ^^hooi for Remediation Teaching in 
HertxeUa, Israel 



The relative contribution of context dependent 
(top-down^ processes to visual word recognition is 
determined by many factors among which reader 
competence is particularly important. Although 
some authors have argued that as fluency 
develops, the reader increasingly relies on 
contextual informacion during word recognition 
(Smith, 1971), mere recent studies have 
discopfirmed this hypothesis. For example, 
semantic priming facilitates lexical decisions more 
in children than in adults and more in younger 
than in older children (Schvaneveldt, Ackerman, 
& Semlear, 1977; West & Stanovitch, 1978). 
Similarly, context effects are greater in poor 
readers than in good readers both in lexical 
decision (Schwantes, Boesl, & Ritz, 1980) and 
na.ning tasks (Perfetti, Goldman, & Hogaboam, 
1979; Stanovitch, West, & Feeman« 1981). Within 
the same subject, larger context effects occur 
when bottom-up processes are inhibited by 
degrading or masking the target stimuli (e.g., 
Becker & Killion, 1977; Massaro Jones, Lipscomb, 
& Scholr, 1978; Meyer, Schvaneveldt, & Ruddy, 
1975). These results imply that the increased role 
of higher-level contextual processes in visual word 
reco^ition is caused by a need to compensate for 
deficiencies in lower ]evel, decoding processes, 
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such as those that occur with poo 'eaders or 
degraded stimuli (Perfetti & 1 .h, 1981; 
Stanovitch et al.» 1981). 

The observation that context effects are larger 
in poor than in good readers does not imply, 
however, that poor readers are better or more 
efficient users of contextual information than good 
readers. In fact, the opposite may be true. Both 
Perfetti et al. (1979) and Schwantes et al. (1980) 
reported that when subjects are required to use 
tlie context for predicting the target word before 
seeing it, skilled readers do better than poor 
readers. The magnitude of semantic priming 
effiects in visual word recognition may therefore be 
a rather poor indicator of the real ability to use 
coriextual information. A relatively unbiased 
method for comparing good and poor readers' 
awareness of contextual information would be to 
evaluate context effects on word recognition in 
situations that either eliminate the need for 
decoding of print, as in auditory presentation, or 
that force all readers, regardless of skill, to use 
contextual processes to the same extent 

As a language in which efficient use of 
contextual information is essential for fluent 
reading, Hebrew provides an excellent channel 
through which to examine syntactic effects. In 
Hebrew orthography, letters represent mostly 
consonants, while vowels are represented by 
diacritical markg placed below, inthin, or above 
the letters. The vowel marks are usually omitted 
in writing except in poetry, holy scripture, and 
children's literature. Becuuse different words may 
be represented by the samj consonants but 
different vowels, when these vowels are absent up 
to seven, eight or more different words may be 
represented by the same string of letters. In 
addition to being semantically ambiguous, these 
Hebrew hom^: graphs are ^iso phonemically 
equivocal because the (abseut) vowels of the words 
that are responsible for the di^*srent meanings 
vary from word to word. Therefore, fluent readi** j 
of "Sinvoweled" Hebrew re<;uire? heavy reliance on 
contextual information. 

Reading instruction in Hebrew starts, as a rule, 
with the "voweled* orthographical system in 
which the diacritical marks are presented with 
the consonant letters. The vowels are gradually 
omitted from school texts starting at the 
beginning of the third grade. During the third 
grade, the children begin to learn to read without 
vowels. By grade four, they are expected to be 
fluent readers of unvoweled texts. Informal 
discussions with teachers, however, revealed that 
the transition from reading voweled to reading 



unvoweled material is not equally easy for all 
children. According to teachers, some children are 
good readers as long as the diacritical marks are 
present but are slow in acquiring the skill of 
reading without the vowels. Because without 
vowels the context of the sentence is a primary 
source of phonological constraints on reading, we 
suspected that the children in thU group, 
although knowing the grapheme to-phoneme 
transformation rules, do not (or can not) use 
contextual information efficiently. Thus, in .spite 
of being phonologically skilled, those children may 
be poor readers. Thus, Hebrew may be a 
convenient medium through which to test the 
h3rpothesis that at leas' some deficient readers are 
less able than good readers to use content 

We decided to manipulate syntactic rather than 
semantic context i:i the present research. The 
main reason for this decision is that syntax is 
probably a more basic linguistic ability than 
semantics (see Chomsky, 1969) and less affected 
by reading experience (Lasnik & Grain, 1985). In 
addition, syntactic violations are more clearly 
defined than are manipulations of semantic 
association strength. 

The effect of prior syntactic structure on the 
processing of a visually presented target word hes 
been investigated in a number of studies. Lexical 
decisions regarding target words arc faster when 
they are preceded by syntactically appropriate 
primes than when preceded by syntactically 
inapprooriate primes (Goodman, McClelland, & 
Gibbs, 1981; Lukatela, Kostid, Feldman, & 
Turvey, 1983). Lexical decision (West & 
Stanovitch, 1986; Wright & Garret, 1984) and 
naming (West & Stanovitch, 1986) are facilitated 
when targets are syntactically congruent with 
previously presented sentence fingments relative 
to when targets follow a syntactically neutral 
context 

Although the syntactic context effect on word 
recogrsi^^ivn is reliable, information on the relation 
between this effect and reading ability is 
comparatively scarce and controversial (for a 
review see Vellutino, 1S79). Several studies report 
that poor readers are inferior to normal readers in 
dealing with complex syntactical structures in 
speech (Brittain, 1970; Bryne, 1981; Cromer & 
Wiener, 1966; Goldman, 1976; Guthrie, 1973; 
Newcomer & Magee, 1977; Vogel, 1974). Other 
authors, however, have challenged a simple 
interpretation of these results (Glass & Pema, 
1986). For example, Shankweiler and Grain (1986) 
have suggested that the poor readers' apparent 
deficiency in processing complex syntactic 
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information may be an epipi;enomenon of 
limitations of the working memory processor, 
rooted in a difficulty in generating phonological 
codes. 

The present study sought to examine the 
relation between word recogniticn and syntactic 
aw/eness; i.e., sensitivity to syntactic structure 
anu the ability to use syntactic knowledge 
explicitly, in children who vary in reading 
competence. Experiment 1, using a group of adult 
fluent readers of Hebrew, was designed to 
establish the validity of the procedure of testing 
the effect of ^ntactic context on the identification 
of auditorily presented words masked by white 
noise. Experiment 2 examined syntactic context 
eff*ects in two groups of children. One group was 
composed of children with learning disorders 
drawn from a population of students who were 
selected by the school system for special 
supplementary training in reading; a comparison 
group was formed of good readers drawn from the 
population of fourth-graders of two elementary 
schools. In Experiment 3, the same group of good 
readers was compared with a group of pnor 
readers from the same elementary schools. Tixe 
poor readers were matched with the good readers 
for their ability to apply grapheme-to-phoneme 
transformation rules in reading voweled 
pseudowords, but they were significantly inferior 
in reading sentences when the words were printed 
without the diacritical vowel marks. 

EXPERIMENT 1 

The purpose of the present experiment was to 
establish the eff*ect of syntactic context on the 
identification of auditorily presented words that 
were masked by white noise. The auditory 
modality was used to attenuate the deficient 
readers' excessive reliance on contextual 
information in reading (which is presumably 
caused by the reed to compensate for their 
difficulty in decoding the print). Stimulus masking 
was incorporated in the procedure because 
previous studies suggest that degradation 
increases the tendency of all subjects to use 
contextual information for word recognition 
(Becker & Killion, 1377; Stanovitch & West, 
1981). We c^^ose to use identification rather than 
reaction time measures in order to keep the 
measurement as simple and direct as possible. 

Subjects were presented with a list of three- or 
four-word sentences that were pre-recorded on 
vape. In each sentence, white noise was 
superimposed on one or several (target) v^rords. 
The subjects were instructed to identify the 



masked words. Half the targets in the list were 
congruent with the syntactic structure of the 
sentence in which they appeared whereas the 
other targets were incongruent, that is, caused a 
syntactic violation. We predicted that the 
percentage of correctly identified targets would be 
higher for syntactically congruent words than for 
syntactically incongruent words. 

Method 

Subjects 

The subjects were 28 undergraduate students 
(14 males) who participated in the experiment for 
course credit or for payment They were all native 
speakers of Hebrew wi^h normal hearing. 

Test Materials 

The auditory test included l(k three- or four- 
word sentences. Each sentence was used in two 
forms: a) syntactically correct and b) syntactically 
incorrect. The incorrect sentences were 
constructed by changing the correct sentences in 
one of the following 10 ways. 

Type 1 - In this category there were 12 
sentences in which the gender compatibility 
between the subject and the predicate was alu ^ 
In six of these sentences a masculine subject was 
presented with a feminine predicate and in the 
other six a feminine subject was presented with a 
masculine predicate. The masked ti./get was the 
predicate, which was always the last word in the 
sentence. 

Type 2 - In this category th^re were 12 
sentences in which the compatibility of number 
between subject and predicate was altered. In six 
of these sentences a singular predicate followed a 
subject in plural form, and vice-versa in the other 
six. The masked target was the predicate which 
was the last word in ea^h si'ntence. 

Type 3 - There were eight sentences in this 
category. The compatibility of gender and number 
between the subject and the predicate was altered 
in each sentence. The masked target was the 
predicate which was the last word in each 
sentence. 

Type 4 - In Hebrew, prepositions related to 
personal pronouns (e.g., ^on me") become one 
word. The violation in this category consisted of 
the decomposition of the composed pronoun into 
two separate words (the pronoun and the 
preposition) which kept the meaning but were 
syntactically incorrect. Eight different 
prepositions were used, four times each, totaling 
32 sentences. The masked target words were the 
composed or decomposed pronouns. Because the 
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decomposition of the pronoun and preposition 
added one word to the sentence, another word was 
excluded from each of the incorrect sentences. 

The syntactical violations of typt^.r. 5 to 10 were 
based on changing compulsory order of parts of 
the sentence. Therefore, the sentences of these 
types were masked from the first to the last word. 

IVpe 6 - In the ten sentences of this type, the 
order of the attribute and its nucleus was 
-reversed. 

Type 6 - In each of the six three-word sentences 
of this type, the predicate was incorrectly 
introduced between the subject and its attribute. 

Type 7 - In Hebrew, the negation always comes 
before the negated predicate. We altered this fixed 
order in six sentences, using three different 
negation words. 

Type 8 - In six sentences, the interrogative 
word was moved from its fixed place at the 
beginning of the sentence to the second place. 

Type 9 - In six sentences, the fixed order of 
preposition and noun was reversed so that the 
noun appeared bef^e the preposition. 

Type 10 - In six sentences, the copula that 
should occur between the subject and the 
predicate r/as moved to the beginning or the end 
of the sentence. 

All 104 correct and 104 incorrect sentences were 
recorded on tape by a female native spe^aker of 
Hebrew. The Upes were sampled at 20KHz. The 
masked intervj'.ls were marked and white noise 
was digitally aJded to the marked epochs with a 
signal-to-noise ratio of 1:2.75. This ratio was 
determined on the basis of pilot tests so that 
correct target identification level was about 50%. 



The 208 sentences were organized into two lists. 
In each list, 52 sentences were syntactically 
correct and 52 sentences were syntactically 
incorrect. Each sentence appeared in each list only 
once, either in correct or incorrect form. Sentences 
that were correct in list A were incorrect in list B 
and vice versa. Fourteen subjects were tested with 
list A and the other 14 with list B. Thus, each 
subject listened to an equal number of correct and 
incorrect sentences, and across subjects each 
sentence appeared an equal number of times in 
each form. 

The sentences in each list were randomized and 
re-recorded on tape. Subjects listened to the tapes 
via Semmheiser earphones (HD-420). 

Procedure 

The subjects were tested individually in a quiet 
room. The experimenter listened to the stimuli 
simultaneously with the subject and stopped the 
tape-recorder at the end of each sentence. The 
subjects were asked to repeat the masked part of 
each sentence, and were encouraged to guess 
whenever necessary. The subjects' responses were 
recorded manually by the experimenter. Subjects 
were randomly assigned to List A or B. 

Results and Disoission 

The average percentage of correct responses, 
across subjects and sentence ti'pes was 41.3%. 
Overall, correct identification was 67.0% for the 
syntactically correct sentences but only 15.6% for 
the syntactically incorrect sentences. The 
syntactic context effect was evident for each type 
of syntactic violation (Table 1). 



Tabic h Percentage of correct identification of syntactically correct and syntactically incorrect sentences in each 
ynolatton-type category (see text). 
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These observations were confirmed by two- 
factor analyses of variance with subjects and 
sentences as random variaoles. The factors were 
syntactic context (correct, incorrect) and syntactic 
violation type CType 1 to 10). The main effect of 
syntactic context was significant (F(l,26)=588.44, 
MSe=629, p < .0001 for the subject-analysis and 
F(l,91)= 140.93, MSe=624, p <. 0001 for the 
stimulus-analysis). The main effect of sentence 
type was also significant (F(9,234)=4.5, MSe=404, 
p < .0001 for the subject-analysis and 
F(9,91)=2.21, MSe=437, p < .03 for the stimulus- 
analysis). The context effect was conspicuous for 
all ^tactic violation types, but, as suggested by 
an interaction between the two factors, its 
magnitude differed. This interaction was 
significant for the subject analysis (F(9,234)=4.73, 
MSe=322, p < .0001), but only marginal for the 
stimulus analysis (F(9,91)=1.86, MSe=624, p < 
•07). Tiikey-A post hoc analysis of the interaction 
revealed that the context effect was greater for 
type 3 (gender and number), type 4 (composite 
pronoun), type 8 (translocation of interrogative 
word), type 6 (separation of subject and attribute), 
and type 7 (translocation of negation word) than 
for all the other sentence-types. Within these two 
groups, the context effects were similar. 

The results of Experiment 1 demonstrate that 
identification of words in sentences is influenced 
by the syntactic coherence of the St.itence. 
Because, across subjects, exactly the same words 
were masked and had to be identified in the 
syntactically corre t and incorrect sentences, the 
difference in tae correct identification rate 
between tl e two modes of presentation is probably 
due to the manipulation cf syntactic coherence. 
The magnitude of this effect seemed to vary across 
different types of syntactic anomalies but it was 
reliable and statistically significant for each type. 
Because we had neither a priori predictions about 
the effects of particular violation-types on 
identification nor clear post hoc explanations for 
the observed differences, and because the type of 
violation is not directly relevant to the issues 
investigated in this study, the syntactic violation- 
types will be collapsed in all further analyses. 

On the basis of these results, we can use the 
technique of Experiment 1 to assess possible 
differences in the magnitude of the effect in good 
and poor readers. 

EXPERIMENT 2 

Experiment 2 comp d the magnitude of the 
syntactic co ntext efPe^ , on the identification of 
auditorily presented words in children who are 



good readers and children with a severe reading 
disability. As was elaborated in the introduction, 
although several studies reported that poor 
readers are deficient in syntactic comprehension 
(see Vellutino, 1979), others could not find solid 
evidence to support this hypothesis (e.g.. Glass & 
Pema, 1986). If disabled readers are less aware of 
the syntactic structure of the sentence (as part of 
their general linguistic handicap), or do not use 
syntactic information as efficiently as good 
readers, syntactic context effects should be weaker 
in disabled than in good readers. Consequently, 
the effect of syntactic congruity on correct 
identification of sentences should be smaller in 
disabled than in good readers. 

A second prediction concerns the nature of 
errors made by good and disabled readers in the 
identification of words presented in syntactically 
incorrect sentences. The auditory mask probably 
induces some degree of uncertainty in the 
auditory *nput. If listeners are aware of the 
sentence context, they may attempt to use it to 
complement the information that is missing in the 
auditory stream. Such a strategy would cause 
errors in the identification of words that violate 
the syntactical structure because in those 
sentences the target does not conform to the 
expected syntactic rules. Therefore, errors in 
identification that are induced by syntactic 
awareness should reflect the use of correct 
syntactic forms. In an English example, if the 
sentence Mere "I would like to have many child" 
and the word ''child" was masked, the subject may 
erroneously identify the target as ''children." On 
the other hand, if the subjects are not aware of the 
syntactic structure or not bothered by its 
violations, their errors in the identification of 
masked words should not be related to the 
syntactically correct form of the target. In this 
case, the response may be a randomly selected 
word, or may relate to the acoustical form — for 
example, substituting "mild" for ^he target "child" 
in the above example. If good readers are more 
aware of the syntactic structure of the sentence 
than disabled readers, the percentage of errors of 
the first type — ^"syntactic corrections," and of the 
second type— "random errors"— should vary with 
reading ability. In an extreme case, we should find 
more "syntactic correction" errors than "random" 
in good readers and vice versa in disabled r^.aders. 

Analyses of the errors made by each reading 
group would be a first step towards understanding 
the cause of inter-group differences in syntactic 
context effects, if they exist. However, in order to 
assess syntactic awareness as a metalingu ^tic 
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ability rather than automatic use of syntactic 
structures for word identification, a more direct 
measure haft to be emx)loyed. In a recent study, 
Fowler (198b; compared good and poor readers* 
ability to detect and to correct violations of syntax 
in orally presented sentences. In that study, the 
ability to judge sentences as correct or incorrect in 
the 'judgment* task was not associated with 
reading ability. In contrast, good readers 
performed syntactically better than poor readers 
in the 'correction'* task. Fowler concluded that 
poor readers do not differ from good readers in 
^mtactic knowledge but that they may be inferior 
in manipulating verbal material in short-term 
memoiy (see also Shankweiler & Grain, 1986). We 
used Fowler^s technique to supplement our study 
of sjmtactic context effects on the identification of 
orally presented words. If, as in Fowler^s study, a 
difference emerges only for the correction 
condition, then syntactic awareness is not at fault 
Rather, one would ascribe the differences to 
syntactic processing difficulties that prevent the 
disabled readers from using their syntactic 
knowledge productively. 

Method 
Tests and Materials 

A. Reading Tests. We were interested in 
testing two kinds of reading: the ability to decode 
the phonology from print and the ability to use the 
sentence context in reading without vowels. 
Because all the standard reading tests in Hebrew 
primarily test reading comprehension, we 
constructed two new reading tests for our 
purposes. The first was a test of decoding ability. 
It contained a set of 24 meaningless three- or four- 
letter strings (pseudowords) presented with vowel 
marks. The vowels were chosen according to 
Hebrew morphophonemic rules, and included all 
lawful combinations. Each pseudoword was 
printed individually on a white, 9 ci.. X 12 cm 
cardboard. The size of each letter was 0.5 cm. The 
subject was instructed to read each pseudoword 
exactly as it was written. The accuracy and 
naming onset time were measured. The subject's 
score on this test consisted of the percentage of 
accurately read pseudowords and the mean 
latency of naming onset time. 

The second test was designed to test the ability 
to read Hebrew without vowel marks and 
particularly to use the sentence context to 
determine the reading of unvoweled Hebrew 
words tha* were both phonologically and 
semantically ambiguous. This tnst contained 48 



four- or five-word sentences printed on white 
cardbord using the same fonts as for the 
pseudowords in the former test. The last word in 
each sentence was the target word. In the absence 
of vowel marks, 32 out of the 48 targets were 
phonologically ambiguous, i.e., they could have 
been assigned at least two sets of vowels to form 
two different words. Thus, correct reading of those 
targets could be determined only by apprehending 
the meaning of the sentence. The 32 ambiguous 
targets were 16 pairs of identical letter strings 
each representing a 'lifferent word in the 
respective sentence. Eight of these 16 ambiguous 
targets represented two words of equal frequency. 
The words represented by each of the remaining 
ambiguous words differed in firequency such that 
one member of each pair was a high-firequency 
word while the other member was a low-frequency 
word. In a previous study Bentin and Frost (1987) 
reported that when undergraduates were 
presented with isolated ambiguous word3 in a 
naming task, they tended to choose the most 
frequent phonological alternative. We assumed 
thatp without context, the children would tend to 
choose the same. Therefore, insensitivity to the 
context of the sentence should increase the 
number of errors in reading the targets, 
particularly when the correct response requires 
the use of the less frequent phonological 
alternative. The remaining 16 targets were words 
that without the vowel marks could have been 
meaningfully read in only one manner. Eight of 
those sixteen targets were high-frequency .words 
and the other eight targets were low-frequency 
words. The subjects were instructed to read each 
sentence aloud. The time that elapsed from the 
moment the sentence was exposed until the 
subject finished reading it was measured to the 
nearest millisecond. The score on this test was the 
average percentage of errors and the average time 
to read a sentence. 

In addition to these two special purpose tests, 
each subject was tested for reading comprehension 
by a standard test The nation-wide average score 
on this test for fourth graders is 70% with a SD of 
12%. 

B. Intelligence tests. The IQ of each subject 
was obtained either using the WISC (Full Scale) 
(whenever those data were available) or testing 
the children on the Raven Colored Matrices and 
transforming their performance into IQ scores. 

C. Syntactic awareness test. Syntactic 
awareness was assessed by testing identification 
of auditorily presented words masked by white 
noise as in Experiment 1. On the basis of a pilot 
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study with children » in order to keep the overall 
correct identification of targets around 50^^, we 
increased the signal-to-noise ratio in Experiment 
2 from 1:2.75 to 1:2.25. 

Procedure 

Each child was tested individually in three 
sessions. During the fir::t session, reading 
performance and intelligence were tested. Reading 
P'^rformance was recorded on tape for subsequent 
error analysis and off-line measuring of time. At 
the end of the test of reading without vowels, the 
experimenter verified whether the subject knew 
the meaning of the targets that had b^en read 
incorrectly. In the very few doubtful cases, tlie 
sentence was excluded and a substitute sentence 
of the same type was given. The children who had 
been selected for this study (see below) were 
invited to a second session during which the 
auditory word identiacaHon test was given. The 
procedures for the word identification test were 
identical to those of Experiment 1. In addition, at 
the end of the second session, tho children were 
tested for tne ability to repeat fi*om memory the 
sentences presented during the auditory test This 
was done by presenting the children with 16 
sentences selected from the same pool of sentences 
from which the test set was selected. Eight of 
these 16 Mntences were syntactically correct and 
the other eight syntactically incorrect In the 
repetition test the sentences were presented 
without the masking noise. Finally, during a third 
session (three months later), all 104 sentences 
were presented to each child without the masking 
noise. Following the present<ition of each sentence 
the child was asked whetl ^r ''this is the way it 
should be said in Hebrew (Judgment Task). 
Whenever the answer was ''no,'* the child was 
asked to correct the sentence (Correction Task). 

Subjects 

The good readers were 15 children (7 males) 
selected from a population of fourth graders of two 
elementary schools in Jerusalem. Their ages 
ranged between 8.9 and 9.7 years (mean age 9.3 
years). The average IQ (FS) score (as assessed by 
transforming the Raven score) was 102.5, ranging 
from 85 to 122.5. They were selected to match 
poor readers from the same school on decoding 
ability and IQ. The precise selection criteria will 
be elaborated in Experiment 3. 

The disabled readers were 19 children (12 
males), aged from 9.7 to 14 years, (mean age 11.6), 
selected from a population of 32 children with 
severe reading disorders who had been referred 
for special supplementary training in reEdinp. 



They were within the normal intelligence range 
(Mean IQ (FS)s 104.83, ranging between 85 and 
130). The disabled readers selected for the present 
study were chosen because they not only showed 
poor decoding ability, as compared to good 
readers, but also performed badly in the test of 
reading without vowels, thus suggesting special 
problems in dealing with context Table 2 presents 
the reading performance of the good and deficient 
readers as revealed our reading tests. 

Table 2. Reading performance of the severely disabled 
and good readers. 



RMdlng Reading 

voweM unvomled Reading 

Nonwords Sentences G>mpreliension 

%of Tune per %oi Time per % 

CTTors Item errois sentence correct 



Good 

Readers %X% 1.6 sec AX% 19 sec 813% 
Disabled 

Readers 38.8% 16 sec 19.6% 63 sec 55.7% 



Children in both groups were all native 
speakers of Hebrew without known motor, 
sensory, or emotional disorders. All children had 
been tested for normal hearing. 

Results 

The overall percentage of correct identification 
of masked targets was similar in good (44.2%) and 
disabled readers (48.3%) (FX 1,32)^2.52, MSesll2, 
p > .12). However, the percentages of syntactically 
correct and incorrect sentences that were correctly 
identified were different i. die two groups (Table 
3). Although ^tacticali^ correct sentences were 
identified better than ^syntactically incorrect 
sentences in both groups, the effect of the 
syntactic context was smaller in the disabled 
readers than in the good readers. 

This obsrrvation was supported by a mixed- 
model two-factors analysis of vrriance. Th<i 
between-subgects factor was reading ability, and 
the within-subjects factor was S3rntactic context 
(correct, incorrect). The syntactic context effect 
was highly significant across groups 
(F(l,32)r.784.47, MSe=50, p < .0001). A more 
interesting result, however, was the significant 
interaction that revealed that the syntactic 
:ontex>: effect was greater in good readers than in 
disabled readers (F(l,32)= 11.90, MSeaSO, p < 
.002). 
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Pott hoc analysis revealed that good and 
disabled readers performed equally well with 
correct sentences. However, the percentage of 
correct identifications of words embedded in 
incorrect sentences was higher in disabled than in 
good readers (p < .01). 

Table 3. Percentage of correct identification of 
syntactically correct and incorrect sentences in the 
reading disabled and good readers. 



Good 
Readers 



Disabled 
Readers 



Syntactically 
Comet 
Mem 

(SEm) 

Syntactically 
Incorrect 
Mean 
(SEm) 



7U 
22 



17.0 
24 



m5 

19 



22 



The errors that children made were distributed 
among four error types: Type 1 errors were 
••syntactical corrections " that is, errors that were 
made in attempt to use the correct syntactic 
structure of a syntactically incorrect sentence. 
Type 2 errors were •'random errors" — 
misidentifications that made no sense whatsoever 
or reflected acoustical confusions, "^pe 3 errors 
were •logical substitutions,* that is, substitutions 
of the masked words with othe words that gave 
the sentence a logical meaning. T^e 4 were •*! 
don't know* responses, which were not encouraged 
but were accepted. The percentage of errors of 
each type (out of the total number of responses) in 
each group is presented in Table 4. 

Table 4. Percentage of errors of each type (out of the 
total number of responses) made by the disabled and 
good readers in the auditory identification tost 



Tvpeftf»rmr 



Logical H don't 



Reading 








Disabled 


39 


10.2 


2!.9 15.7 


(SEm) 


04 


15 


1.7 22 


Good 








Readers 


62 


33 


llA 28.8 


(SEm) 


08 


07 


1.7 2^ 



Because we had clear predictions only for 
syntactic corrections and random errors, we 
analyzed the distribution of these two error types 
in each group by a mixed-model (reading group X 
error type) analysis of variance. This analysis 
showed that, across the two types of error, the 
good readers made fewer errors Uian the disabled 
readers (F(l,32)=5.11, M£=18, p < .031). Across 
groups, the porcentage of errors of each type was 
similar (F(l,32)=3,01, MSe=16, p > .09). Most 
interesting, the interaction between rea^.Ung 
ability and error type was highly significant 
(F( 1,32) =21.54, MSe=16, p < .0001). Post hoc 
analysis (Tukey-A) revealed that more random 
errors were made by disabled readers than by 
good readers, whereas syntactic corrections were 
more frequent in good than in disabled readers. 
All the children were able to repeat verbatim all 
sixteen sentences that they heard without the 
masking noise. 

Good readers were better than disabled readers 
on both the judgment and the correction tests. 
Among the disabled readers, however, a secondaiy 
distinction was evident between four children who 
were 13-14 years old and those who were younger. 
The mean percentage of errors made by each 
group in eadi task is presented in Table 5. 

Table 5. Percentage of errors made by disabled and 
good readers in the Judgment and correction tasks. 

Task 



Good 






Readers 


13 


54 


(SEm) 


04 


09 


Reading 






Disabled 


69 


311 


(SEm) 


19 


43 


Older reading 






Disabled 


0.3 


5.6 



Because the number of older disabled readers 
was too small to form a reliable independent level 
in a factorial design but, on the other hand, 
clearly formed a distinct group, they were 
excluded from the statistical evaluation. Thus, the 
percentage of errors in each task was compared 
only for good and disabled readers who were more 
similar in chronological age. As before, a mixed- 
model analysis of variance was employed where 
the between-subjects factor was reading group 
and the within-subject factor was the test 
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(judgment or correction). The analysis of variance 
showed that good readers made significantly fewer 
errors than a oled readers (F(l,24)=26.58, 
MSe=127, p < .0001) and that more errors were 
made in the correction tlian in the judgment task 
(F(l,24)=79.25, MSe=35, p < .0001). A significant 
interaction suggested that the task affected the 
percentage of errors made by disabled readers 
more than it affected the good readers 
(F(l,24)=41,0, MSe=35, p < .0001). This 
interaction supports Fowler's results by 
emphasizing the difference between the judgment 
and correction tests. However, in contrast to her 
results, post hoc Tukey-A tests revealed that the 
good readers made significantly fewer errors than 
disabled readers not only in Uie correction task, 
but also in the judgment task. The inclusion of the 
four older disabled readers in the analysis did not 
change the pattern of results, although these four 
children clearly performed better than the other 
disabled readers. 

Discussion 

For children with a severe reading disability, 
the S3mtactic context effect on the identification of 
spoken words was smaller than for good readers. 
One explanation for the results might be that 
disabled readers are worse at identifying 
auditorily masked words than good readers 
(Brady, Shankweiler, & Mann, 1983). Such an 
hypothesis, however, is not supported by the 
present data. If the disabled readers in the 
present study had been handicapped in the 
identification of masked words, any manipulation 
that increased the difficulty of identifying the 
words should have had a greater effect on disabled 
than on good readers. Therefore, we should have 
observed a stronger rpther than a weaker 
syntactic context effect in disabled readers. 
Further, if masking had a more deleterious effect 
on identification of words by disabled relative to 
good readers, the overall identification 
performance in the disabled group should have 
been lower. In fact, the overall correct 
identification percentage in disabled readers was 
slightly higher than in the good reader group. 

A second account of the r 'Its might be that 
the smaller syntactic context effect in poor readers 
reflects a more general problem, such as disorders 
of short*term or working memory. There is indeed 
ample evidence that disabled readers have 
problems with verbal short*term memor>' (Mann, 
Liberman, & Sha ikweiler, 1980; for a review see 
Brady, 1986). Therefore, memory disordsrs might 
explain why their periformance is affected by 



sentence context less than that of good readers 
even when decoding difficulties are eliminated; 
they simply do not remember the sentence well 
enough. However, a simply reduced short-term 
memory span cannot easily account for the 
present results because the children in both 
reading groups could accurately repeat sentences 
similar to those used in the identification task 
without any difficulty. It is still possible, however, 
that more complex working memory problems 
could have contributed to the disabled readers' 
pattern of performance. We will return to this 
hypothesis in the General Discussion. 

We are left with the most direct hypothesis that 
inferior ^tactic awareness is the reason for the 
relatively poor use of syntactic context by the 
reading disabled children of Experiment 2. This 
possibility is supported by the results of the error 
analysis. The percentage of "syntactic correction** 
errors made by good readers was almost twice as 
great as that made by disabled readers. ''Syntactic 
correction" errors could have only been made 
when the subject knew what the correct structure 
of the sentence should have been and expected it 
In those circumstances, when the physical 
stimulus was degraded the good readers applied 
syntactic rules and misidentified the target. In the 
same situation, getting only partial infotmation 
from degraded stimuli, disabled readers often 
applied a random guessing strategy disregarding 
the sentence context completely. Indeed, the 
percentage of "random* errors was three times 
greater in the disabled readers group than in good 
readers. 

An additional question examined in the present 
experiment was whether the disabled readers had 
mastered the correct syntactic structures but did 
not use them properly, or had problems with basic 
syntactic knowledge. We examined this question 
by testing the ability of both groups to detect 
violations of syntactic structure and to correct the 
detected violations. The good readers perfc^rmed 
better than the disabled readers in both tasks. 
Although the difference between the groups was 
greater for the correction task, disabled readers 
were significantly inferior to good readers in the 
judgment task as well. ' is latter result 
contradicts the results repOiced by Fowler (1988) 
and suggests that this group of disabled readers 
were inferior t^^ good readers in their awareness of 
basic syntactic structures. 

The discrepancy between the present results 
and Fowler's (1988) results, as woll as the 
disagreement between our conclusion regarding 
the syntactic awareness of disabled readers and 
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previous assertions in the literature that basic 
phonological disability and deficient use of 
working memory mechanisms underlies the 
syntactic inferiority observed in poor re^.'^ers (e.g., 
Shankweiler & Grain, 1986; Shankweiler et al, in 
press), can be explained in two ways. One possible 
explanation is that different types of mechanisms 
underly reading deficiencies in different 
languages. Recall that we selected o^/ deficient 
reader group to emphasize problems of using 
context while reading without vowel marks. In 
doing so, we may have selected a group of children 
who were poor in syntactic processing. A second 
explanation is that we have examined children 
with a reading disability that was considerably 
more severe than that of the poor readers 
examined by Fowler. Experiment 3 therefore 
attempted to generalize the results of the present 
experiment to poor readers selected, as in Fowler^s 
study, from the normal student population of 
regular elementary schools. 

EXPERIMENTS 

Any attempt to generalize about the 
characteristics of reading disability cr to predict 
the performance of children with reading 
disorders is impeded by the heterogeneity o^ this 
population. Indeed, reading disorders can appear 
as the most conspicuous symptom in children who 
suffer from attentiond disorders or general 
learning disability; they can be the main symptom 
(but rarely the only symptom) of developmental 
dyslexia and, at the other extreme, they may 
characterize the performance of otherwise normal 
students who happen to be at the lower end of a 
normal distribution of reading ability. It is 
possible, therefore, that the prior selection of 
different types of reading disorders underlies most 
disagreements about this important handicap 
auaong educators an 1 scientific investigators. 

In Experiment 2, the reading disabled children 
were selected from a population of children with 
severe reading disorders. Although they were at 
least in ^he fourth grade, had normal IQ's, and 
had no documented neurological symptoms, some 
of those children could hardly read single words 
with or without vowel marks. We found that they 
were inferior to good readers in syntactic 
knowledge and in using syntactic context to help 
identify spoken words. In Experiment 3, our aim 
was to extend these findings to another reader 
group. Thus, v/e compared good readers with 
children in regular classes who, when formally 
tested, Arere inferior to good readers in reading 



performaree. In particular, we wished to compare 
the good readers with a group of relatively poor 
readers who were equal to the good readers in 
basic decoding a'-iiity (as revealed by their 
performance on reacang voweled pseudowords) but 
were poorer at reading without vowel marks. We 
assumed that the relatively poor reading 
performance of this group primarily reflects 
inefficient use of the sentence context, and 
expected to be able to measure this relative 
disability by our auditory test of syntactic context 
effects. In addition, the good and poor readers in 
the present experiment were tested for ability to 
detect and to correct syntactic violations. 

Method 

Subjects 

The subjects were 30 children selected from a 
population of 167 fourth graders in two public 
elementary schooi. The selection was based on 
performance on the test of decoding ability and 
the test of reading without vowels which were 
described in Experiment 2. Two reading groups 
were assembled. The poor reader group included 
15 children (9 males); their ages ranged between 
8.8 and 9.6 years (mean age 9.1 years)« Their 
average IQ (FS) score (as assessed by 
transforming the Raven score) was 102.5, ranging 
from 85 to 122.5. Each of the those poor readers 
made no more than four errors (16.6%) in the test 
of decoding voweled pseudowords but at least 
twice as many errors as the good readers while 
reading meaningful sentences without vowels. On 
the basis of the assumption that the relatively 
poor reading performance of those children 
reflected problems with processing of contextual 
information, we will label this group the ''Poor 
context* group. The good readers were the same 
15 children who were described in Experiment 2. 
Each child in this group was selected to match one 
child in the poor context group on the ability to 
decode voweled pseudowords and on IQ. However, 
the good readers performaiice on the unvoweled 
sentences was at least twice as good as that of his 
or her matched subject in the poor context group. 
The average scores of the two reading groups on 
the reading tests are presented in Table 6. 

Tests and Materials 

The reading test? and the auditory word 
identification test were identical to those 
described in Experiment 2. The IQ scores were 
estimated by testing the children with the Raven 
Colored Matrices test. 
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Table 6. Reading performance of the good and poor 
context readers, 



Reading Reading 

voweled unvoweled Reading 

Nonwords Sentences Comprehension 

%of Timepci %of Tune per % 

errori item errora sentence correct 



Good 

Readers 8.1% L6sec 41% 19 sec 813% 
Poor 

Cdnlext 1.7 sec 18^% 4.7 sec 710%, 

Keidert 

Procedure 

The procedure was similar to that employed in 
Experiment 2. The children were tested in two 
sessions. The first session was dedicated to the 
selection of subjects for this study. All 167 fourth- 
graders were tested for reading ability and IQ. 
During the second session only the selected 
children were tested on the auditory word 
identification test, and during a third session, 
their ability to detect and correct the syntactic 
violations. During the third session the sentences 
were presented without any masking noise. 
Sessions one and twc were held close to the 
beginning of the academic year. Session three was 
three months latter. 

Results 

The percentage of total correct identifications in 
the "poor context" group (40.4%) was not 
significantly different from that of the good 
readers (44.2%) (F(l,28)=1.03, IiSe=209, p > .31). 

The percentages of syntactically correct and 
incorrect sentences that were correctly identified 
in each group are presented in Table 7. 



Table 1. Percentage of correct identification of 
syntactically correct and incorrect sentences in good 
and poor context readers. 





Good 
Readers 


Poor Context 
Readers 


Syntactically 






Correct 






Mean 


714 




(SEm) 


22 


35 


Syntacdcally 






Incorrect 






Mean 


17.0 


154 


(SEm) 


14 


3.6 



These data were analyzed by a mixed-model 
analysis of variance as in E:^eriment 2. As before, 
the syntactic context was highly significant 
(fXl,28)=505.56, MSe=81,p < .0001). However, in 
contrast to the findings of Experiment 2, the 
interaction between the syntactic context effect 
and reading group was not significant 
(F(l,28)=0.96). 

As in Experiment 2, the errors made by each 
grcup were categorized into four types. Type 1 
were 'syntactical corrections,'' l^e 2 were 
"random errors," Type 3 were "logical 
substitutions " and Type 4 were "I don't know." 
The distribution of errors in each of the two 
reading groups (out of the total number of 
responses) is presented in Table 8. 

Table 8. Percentage of errors of each type made by 
good and poor context readers (out of the total number 
of responses) in the auditory identification task. 



Type of error 

Logical don't 
Corrections Random Substltutlona know" 



GoodReaden 










Mean 


62 


33 


174 


28.8 


(SEm) 


0.8 


0.7 


1.7 


28 


Poor Context 










Readers 










Mean 


3.5 


6.9 


24J 


24i 


(SEm) 


0.6 


1.2 


2.9 


3.5 



Our a priori predictions concer led only errors of 
Type 1 (correction) and Type 2 (random errors). A 
mixed-model analysis of variance showed no 
significant main effects but a significant 
interaction between the type of error and reading 
group (F(l,28)=10.63, MS'e=14, p < .003). Post hoc 
comparisons (Tukey-A) showed that the 
percentage of syntactical correction errors was 
higher in good readers than in the poor context 
group, whereas the p 3rcentage of random errors 
was higher in the poor context group than in good 
readers. 

The average percentages of errors in the 
judgment and correction tasks for each group are 
presented in Table 9. 

A mixed-model analysis of variance as in 
Experiment 2 was used to analyze these data. 
Across groups, there were moie errors in the 
correction task than in the judgment test 
(F(l,24)=39.16, MSe=16, p < .0001). Overall, the 
good readers made fewer errors than poor context 
readers (F(l,24)=5.24, MSe=25, p < .03S). The 
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interaction between the test and the group factors 
was significant (F(1^4)=6,32, MSe=16, p < ,020), 
Replicating Fowler's (1988) results, post hoc 
analysis revealed that good and poor context 
readers did not differ in the judgment test, 
whereas in the correction test good readers made 
significantly fewer errors than poor context 
readers. 

Table 9. Percentage of errors made by good and poor 
context readers in the judgment and correction tasks. 



Task 

"Judgmcnr "Correction" 



Good 






Reados 






Mean 


IJ 


54 


(SEm) 


0.4 


09 


Poor Context 






Readen 






Mean 


1.7 


IM 


(SEm) 


03 


23 



All children were able to repeat verbatim all the 
16 sentences presented to them in absence of 
masking white noise. 



Discussion 

Experiment 3 sought tc generalize the results of 
Experiment 2 to groups of relatively poor readers 
selected from the normal distribution of fourth 
graders. Unlike in E . «riment 2, the magnitude of 
the qmtactic context exfect was similar in the good 
and in the poor context readers. 

The syntactic ability of the children in the poor 
context group was not, however, entirely 
equivalent to that of the good readers. Analysis of 
the identification errors made by each group 
revealed that the proportion of errors that 
reflected an attempt to correct the syntactic 
violation was lower in children who had relatively 
more difficulties in reading unvoweled words than 
in good readers who were matched with them for 
phonological decoding ability. In contrast, the 
proportion of misidentifications that reflected 
total ignorance of the sentence's context (either 
syntactic or semantic) was lower in good readers 
than in the poor context group. This pattern of 
errors -night suggest that although word 
identification was similarly affected by syntactic 
context in both reading groups, the good readers 
were more aware of the syntactic structure of the 
sentence than were the children in the poor 
context group. The results of the judgment test. 



however, did not support this hypothesis. As it 
turned out, both groups were equally sensitive to 
violations of syntactic structures. It is possible 
though, that part of this result reflected a ceiling 
effect in that task. The groups differed, however, 
in their ability to correct those violations. 

The common aspect of both the "syntactical 
correction* errors and the test of correcting 
syntactic violations is that both measures refect 
the child's ability to actively generate correct 
S3mtactic structures. This ability is not required 
by the judgment test and may not be reflected in 
identification performance. Therefore, the present 
data suggest that although the good and the 
relatively poorer readers did not differ in their 
syntactic awareness— that is, in the sensitivity to 
and knowledge of basic syntactic structures— the 
good readers had a superior ability to use their 
syntactic knowledge, and a tendency to do so. 

GENERAL DISCUSSION 
In the present study we examined the relation 
between reading ability and ^tactic competence 
as it is reflected in the ability to use syntactic 
context for word identification and to detect and 
correct syntactic violations. In contrast to the 
great migority of studies of context effects in good 
and poor readers, we used aaditoiy rather than 
printed word identification. Auditoiy presentation 
was used to circumvent a bias that night nave 
been induced by the reading disorder itself. Tnus, 
we were better able to assess differ^^nces in 
syntactic processing ability that might relate to 
reading achievement. The sensitivity of our 
auditoiy test to syntactic context was verified by 
showing that undergraduate students, lluent 
readers of Hebrew, identified target words masked 
by white noise significantly more accurately if the 
targets were syntactically congruent with the 
sentence in which they appeared than if they 
violated the syntactic structure. 

A syntactic effect similar to that found in 
undergraduates was obtained when the same test 
was given to fourth graders. However, the 
difference betwe«;n the correct identification of 
syntacticrJly correct and syntactically incorrect 
sentences was smaller in a group of children with 
a severe reading disability than in either good 
readers or relatively poor readers selected from 
the normal distribution of fourth-grade students 
(the poor context group)^ The good readers and the 
poor context group did not differ in the auditory 
identification test. 

A second difference between the severely 
reading disabled and the children in the poor 
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context group was observed in the judgment task. 
Children in the poor context group detected 
sentences that contained an error as well as good 
readers. In contrast, the reading disabled were 
worse in this test than either the good readers or 
the poor context readers. 

The relative inferiority of the severely disabled 
readers can not be accounted for only by a simple 
reduction of their short-term memory span. In 
contrast to the complex sentences and complex 
S3mtactic structures typically used in other 
studies, we u^ed only very short and simple 
sentences (three or four words). When formally 
tested, all the children were able to repeat the 
sentences verbatim without any problem. Holding 
a sentence in working memory for syntactic 
analysis probably requires more mental effort and 
retention of the whole sentence for a longer time 
than required by immediate repetition. As was 
previously reported, the factor of delay influences 
the memory Mlity of poor readers more than that 
of good readers (Liberman, Shankweiler, 
Liberman, Fowler, & Fisher, 1977). However, 
rather than requiring the manipulation of more 
subtle syntactic aspects, the syntactic violations 
which we have used in the present study were, as 
we have said, straightforward corruptions of the 
basic ^tactic relationship between subject and 
predicate or a word order that clrarly violated the 
sjrntactic structure of the sentence. Therefore, we 
agree with Byrne (1981) in doubting that deficient 
use of verbal memory mechanisms by disabled 
readers, at least as this deficiency could be 
revealed by simple repetition, was a mcyor cause 
for the deficient use of syntactic context in the 
present study. Instead, we are inclined to believe 
that the reduced syntactic ability suggested by the 
performance of disabled readers reflected a 
genuine deficiency of linguistic endowment (in the 
syntactic and phonological domains) rather than 
reduced general cognitive ability or poor 
metalinguistic insight. 

Althougl the syntactic context efiect on the 
identification task was equal in the good reader's 
and the poor context readers, the syntactic 
competence of these two groups was not entirely 
equivalent In particular, the good readers made 
significantly more syntactical correction errors 
than the poor context readers. The difference 
between the two groups was even more 
conspicuous in the correction task. Similar to the 
results reported by Fowler (1988) for American 
poor readers, the ability of poor context children to 
correct syntactic violations was significantly 
inferior to that of good readers. This result is 



particularly interesting because, as was noted 
earlier, the syntactic violations used in the 
present study were much simpler and more direct 
than those used by Fowler. Moreover, our samples 
of good and poor context readers were matched for 
their ability to decode and read voweled non words. 
Therefore, just as for the disabled readers, the 
difibrence between the ability of good and poor 
':ontext readers to correct syntactically incorrect 
sentences cannot be easily accounted for only by 
assuming differences between the poor and the 
good readers in general cognitive skills. Rather, 
we suggest that, at least for the specifically 
selected group of poor readers whose reading 
errors reflected reduced ability to analyze 
contextual information, both the correction test 
and the pattern of errors in the identification test 
suggest a specific impairment in the ability to use 
their ^tactic knowledge in a productive way. 

Although both the reading disabled and the poor 
context readers are inferior to good readers in 
syntactic competence, these two groups differ from 
one another. In comparison to good readers, the 
disabled readers showed a weaker syntactic 
context effect in the word identification task, an 
inferior ability to detect ^tactical aberrations in 
spoken sentences, and an inferior ability to correct 
detected syntactically incorrect sentences. The 
poor context children were equal to good readers 
in the syntactic context effect on word 
identification, were equally able to detect 
sjmtactic aberration^:, but were inferior to good 
readers in the ability to correct the detected 
errors. This pattern of results suggest that the 
different tasks tap different aspects of syntactic 
competence which might develop at different 
rates« 

Some insight into the nature of the syntactic 
disability reflected by the word identification task 
comes from the observation that the significant 
interaction between the reading group and the 
syntactic context effect was not caused by a 
symmetrical effect of reading group on both 
syntactically correct and incorrect sentences. 
Rather, it seems that syntactically correct 
sentences were identified equally well by both 
reading disabled and good readers; however, good 
readers were affected more than reading disabled 
by violations of the syntactic structure of 
sentences. A possible interpretation that is 
supported by these data is that automatic 
syntactic processing was equivalent in both 
groups, but that the disabled readers were less 
aware of the sjmtactic structure and did not use 
identification strategies that were based upon it 
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A more definite interpretation obviously requires 
a neutral condition in the identification task, 
which was absent in this study. However, these 
results strongly suggest that the identification 
test is sensitive more to strategic differences and 
sjmtactic awareness than to (automatic) syntactic 
processing. 

It is noteworthy to recall in this connection the 
four older disabled readers: Although they were 
similar to other disabled readers in the auditory 
identification task, their performance in the 
judgment and correction tasks was similar to that 
of good readers. These older children may have 
had a hi^er level of syntactic competence so that 
when their attention was intentionally directed to 
the structure of the sentence (as was the case in 
the judgment and correction tasks in contrast to 
the identification task), the additional knowledge 
enabled them to use their syntactic knowledge 
productively. 

In conclusion, the results of the present study 
suggest that qmtactic factors are directly related 
to reading disabilities, at least in Hebrew. Two 
distinct populations ^f poor readers have been 
identified. One group was formed of children who 
in absence of a better term were labeled reading 
disabled. These children were probably able to use 
basic syntactic structures, as was evident in their 
everyday speech ability and in their identification 
of syntactically correct sentences. However, they 
were not explicitly aware of the syntactic 
structures, and therefore were not inhibited by 
semantic incongruity in the identification test; 
they were less able than good readers to detect 
syntactically incorrect sentences, and they were 
less able to correct those errors that had been 
detected. The second group of poor readers were 
good decoders but were relatively weak in 
analyzing the context of the sentence in reading. 
The performance of these children in the 
identification judgment tests suggested that they 
were aware of basic syntactic structures and could 
use them for perception of speech. However, they 
were inferior to good readers in using those 
structures productively as suggested by their 
relatively worse performance in the correction 
test. Thus, our data set limits to previous 
assertions that poor reading is not related to 
syntactical impairment (Gleitman & Rozin, 1977; 
Liberman, 1971; Mattingly, 1972; Shankweiler & 
Grain, 1986). 

Of course, we do not claim to have found a 
causal relationship between syntactic ability and 
reading disorders. What we have seen is that, at 
least in Hebrew, there are poor readers of normal 



inteUigence who are good decoders. Their 
performance suggests that there are aspects of 
poor reading that are not accounted for by 
deficient phonological processing. Moreover, we 
have shown that this impairment is associated 
with deficiencies in linguistic ability, here 
exemplified in the ^tactic domain. 
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Effect of Emotional Valence in Infant Expressions upon 
Perceptual Asymmetries in Adult Viewers 



Catherine T. Bestt and Heidi Fr6ya Queentt 



Research on nonnal and brain-dauaged adulu indicates that the cerebra) hemispheres 
are spedalized for emotional as well as cognitive flinctions. However, controvert remains 
oyer which pattern of cerebral organization best accounU for emotion perception: overall 
nght hemisphere (RH) superiority; RH specialization for negative emotion but left 
hemisphere (LH) specialization for positive emotion; RH specialization for negative 
emotion but no asymmetry for positive; or RH specialization for avoidance-related 
emotions and LH specialization for approach-related mies. Most studies of normal adults 
support overall RH specialization for emotion perception. However, there is some 
suggestion of valence effects in perceptual asymmetries, which may depend on 
engagement of emotional responses in the viewer. The present research examined 
asymmetries in adulU' oerception of smiling and crying infant expressions because, 
according to ethological theory, infant characUristics elicit heightened emotional 
responses in adults. ResulU from Experiment 1 supported RH specialization for peiteption 
of negative expressions, with a lack cf asymmetry for positive expressions. EiqierimenU 2 
md 3 investigated whether this negative-valence effect can be attributed to differences in 
the mvolvement of LH feature-oriented versus RH holistic apptxMKhes to basic information 
processing, or rather to some other independent, emotion-related specialization of the 
hemispheres. The resulU of the latter two experimenU were inconsistent with the 
ififoraation processing explanAtion. Discussion concludes with a suggestion that the 
adulfs specialized RH sensitivity to infant crying expressions may have evolved in 
response to selection pressures for rapid response to signals that indicate potential threaU 
to infant survival. 



Research findings with both unilateral brain- 
damaged patients and normal adults have led to 
general consensus that the human cerebral 
hemispheres are differentially involved in 
emotional, as well as cognitive, processes. 
Hoi%ever, the exact pattern of hemispneric 
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involvement in emotions remains controversial. 
According to the most widely-held view, the right 
hemisphere (RH) dominates overall in the 
perception and expression of emotion, across both 
negative and positive emotional valence (e.g., 
Campbell, 1978; Chaurasia & Goswami, 1975; 
Gainotti, 1972, 1988; Hirschman & Safer, 1982; 
Ladavas, Umilta & Ricci-Bitti, 1980; Ley & 
Bryden, 1979, 1981; Safer, 1981; Strauss & 
Moscovitch, 19^1). For convenience, iJiat view will 
be referred to here as the RH hypothesis. The 
miyor counter-proposal has been that the right 
hemisphere predominates in perception and 
expression of negative emotions, the left in 
positive emotions, a view we will refer to as the 
valence hypothesis (e.g.. Ahem & Schwartz, 1979; 
Dimond & Farring.^n, 1977; Natale, Gur & Gur, 
1983; Reuter-Lorenz & Davidson, 1981; Reuter- 
Lorenz, Givis & Moscovitch, 183; Rossi & 
Rosadini, 1967; Sackeim, Greenberg, Weiman, 
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Gur, Hungerbuhler, & Geschwind, 1982; 
Silberman & Weingartner, 1986; Terzian, 1964). 
Several variations on the valence hypothesis have 
also been offered Some evidence suggests that 
while negative emotions show differentia! right 
hemisphere involvement, there may be less 
hemispheric asymmetry for positive emotions 
(e.g., Dimond, Farrington & Johnson, 1976; 
Ehrlichman, 1988; Sackeim & Gur, 1978, 1980); 
we will call this view the negative-valenee 
hypothtsiM. Another possibility is that the 
differential involvement of the left and right 
hemispheres in emotions may depend on the 
motivational qualities of approach versus 
avoidance, respectively, rather than on the 
positive versus negative valence o^ the emotion 
per se (e.g., Kinsboume, 1978). According to 
Davidson and colleagues, such an approach- 
avoidance distinction between hemispheres 
pertains only to the suhgect's emotional experience 
(internal feeling-state) and expression (mediated 
by frontal I )e8), but not to perception of emotions 
(parietal lobes), which show an overall right 
hemisphere superiority (Davidson, 1984; Davidson 
& Fox, 1982; Davidson, Schwartz, Saron, Bennett, 
& Goleman, 1979; Fox & Davidson, 1986, 1987, 
1988). For bre/ity, we will call the latter proposal 
the motivational hypothesis. 

This report focuses on asymmetries in normal 
adults for perception of infant facial emotional 
expressions. The majority of findings on 
perception of adult facial expressions by normal, 
neurologically-intact subjects have favored the RH 
hypothesis (e.g., Brody, Goodman, Halm, 
Krinzman & Sebrechts, 1987; Bryden, 1982; 
Bryden & Ley, 1983; Campbell, 1978; Carlson & 
Harris, 1985; Gage & Safer, 1985; Heller & Levy, 
1981; Hirschman & Safer, 1982; Levy, Heller, 
Banicn, & Burton, 1983; Ley & Bryden, 1979, 
1981; Moscovitch, 1983; Safer, 1981; Segalowitz, 
1986; Strauss & Moscovitch, 1981). These studies 
have typically found a left visual field (LVF) 
advantage, implying RH superiority, in perception 
of both positive and negative emotional 
expressions. 

Only a few perceptual asymmetry studies have 
supported the valence hypothesis or its variants. 
In favor of the valence hjrpothesis, adults rate 
tachistoscopically-presented facial expressions 
more negatively when they are presented to the 
LVF-RH, but rate them mure positively in the 
right visual field (RVF)-left hemisphere (LH), 
although the RH is better overeM at 
differentiating among categories of emotion 
(Natale et al, 1983). Similarly, when subjects 



must identify the visual field containing an 
emotional expression during simultaneous 
tachistoscopic presentations of an emotional 
expression in one visual field ana a neutral 
expression in the opposite visual field, they detect 
negative expressions more rapidly in the LVF-RH 
but detect positive expressions more rapidly in the 
RVF-LH (Reuter-Lorenz & Davidson, 1981; 
Reuter-Loren^ et al., 1983). The motivational 
hypothesis is supported by research on EEG 
responses in subjects viewing films of negative 
versus positive emotional expressions* Both adults 
(Davidson et al., 1979) and infants (Davidson & 
Fox, 1982) showed greater electrocortical 
activation of the ri^t frontal lobe while viewing 
emotionally negative films, but greater activation 
of the left frontal lobe during positive films. 
However, parietal lobe activation was greater on 
the right Uian the left side at both ages for both 
types of film. No studies on perception of facial 
expressions support the negative-valence 
hypothesis. However, when emotionally negative 
films have been presented to a single hemisphere 
by having subjects wear half-silvered contact 
lenses (Dimond et al., 1976), or when emotionally 
negative odorants have been restricted to a single 
hemisphere via presentation to the ipsilateral 
nostril (Ehrlichman, 1988), subjects rated the 
stimuli presented to the RH as more intensely 
negative, without showing significant 
asymmetries for rating emotionally positive 
stimuli. 

Why are there such inconsistent findings across 
studies of cerebral asymmetries in normal adults' 
perception of facial expressions? To some extent, 
they may be explained by variations in 
metiiodology and task requirements. Studies 
supporting the RH hypothesis have typically 
required subjects to recognize or discriminate 
facial expressions. In contrast, the studies 
favoring variants of the valence hypothesis have 
called for judgments about the emotionality of the 
stimuli. Recognition and discrimination 
judgments call upon- basic cognitive-perceptual 
skills, and may be carried out by so-called "cold" 
cognitive abilities, whereas judgments about 
emotionality may more directly require the viewer 
to tap into emotional processes. This observation 
raises the related possibility that the presence or 
absence of a valence effect in perception may 
depend on the viewer's own emotional response to 
the stimuli, ss Davidson (1984) and Ehrlichman 
(1988) have suggested. The viewer's emotional 
response, in turn, could very likely be influenced 
by whether the emotions represented in the 
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stimulus photographs are genuine or spontaneous 
versus simulated or posed. Clinical evidence 
indicates that produetion$ of spontaneous and 
posed expressions are mediated by different 
neural pathways; the former is disturbed by 
damage to the temporal lubes or extrapyramidal 
qrrtem, while the latter is disturbed firontal or 
pyramidal damage (e.g., Monrad-Krohn, 1924; 
RemiUard, Anderman, Rhi-Sausi ft Robbins, 19/7; 
Rinn, 1984). These two classes of expression could 
therefore be expected to provide different 
information to the viewer about th^^ emitter^s 
internal emotional stata. Percr-'. ers would 
presumably be more likely to have aii emotional 
response themselves while viewing genuine 
emotional expressions than they would while 
viewing simulated expressions. In this context, it 
is important to note that nearly all the data on 
asymmetries in perception of facial expressions 
have been obtained with stimuli containing posed 
rather than qMntaneous emotional expressions. 

For these reasons, we conducted a series of 
experiments requiring judgments about the 
emoMonality of facial expression stimuli that 
should be more likely to elicit emotional responses 
in the viewer. Photographs of smiling and crying 
infants were chosen as the stimulus materials, 
based on several considerations. First, infant 
expressions are certainly more spontaneous and 
genuine, thus providing a more direct window on 
the infant's actual affective state, than are most 
adult expressions, especially those facial 
expressions that oc.«ur in social situations. Even in 
the case of so-called spontaneous expression^ in 
adults, the facial display is often influenced to 
some extent by the ^-^rces of social conditioning 
and cultural display niles (Buck, 1986; Ekman, 
1972), which would have much less or no influence 
on infants' expressions (e.g., Campos, Barrett, 
Lamb, Goldsmith, & Stenberg, 1983; Rothbart & 
Posner, 1986). It is generally assumed that infants 
do not begin to simulate, mask, or deliberately 
control their facial expressions until Ihe second 
year of life (e.g., Campos et al., 1983; Oster & 
Ekman, 1978; Rothbart & Posner, 1985; Sroufe, 
1979; cf Fox & Davidson, 1988). Second, whereas 
adult expressions often involve complex mixtures 
of emotions, infant expressions tend to be simpler, 
reflecting purer examples of the basic categories of 
emotion (Campos et al, 1983; Izard, 1979; Izard, 
Huebner, Risser, McGinnes, & Dougherty, 1980). 
This characteristic of infant expressions may elicit 
simpler, more straightforward emotional 
responses from viewers than do more co^nplex 
adult expressions. Third, ethological theory and 



research argue that infant expressions and 
appearance strongly tend to elicit emotional 
responses from adults, and do so to a greater 
extent than do the «ixpressions and appearance of 
(unfamiliar) adults (e.g., Bowlby, 1969; Eibl- 
Eibesfeldt, 1975; Lorenz, 1935, 1981; Lorenz & 
Leyhausen, 1973). Adults' emotional responses to 
infant signaTi, including infant emotional 
expressions, are part of a mutually adapted 
qrstem of evolved behaviors that promote the 
development of the relatively helpless human 
infant, which thus footers the reproductive fitness 
of individuals displaying these characteristics, 
and ultimately the survival of the species. These 
responses to infants are, of course, particularly 
strong in their caregivers, but are present in all 
humans. 

Infant crying and smiling are of particular 
interest to Uie present research when considered 
in ethological terms. Both serve to promote 
physical proximity between the infant and the 
caregiver, although for different reasonr (e.g., 
Bowlby, 1969; Campos et al. 1983; Kmde, 
Gamsbauer& Harmon, 1976). Smiling indicates a 
positive affective state and emotional approach of 
the infant toward the adult with which it is 
interacting. Infant smiling typically also elicits 
po$itivt feelings and a corresponding approach 
response from that adult The motivational 
tendencies associated with infant ciying, however, 
differ for the infant and the responcUng adult The 
infant'f distress indicates negative feelings 
associated with some noxious stimulation or 
situation, and therefore a tendency for withdrawal 
in the infant. However, given an infant who 
cannot actively, physically withdraw itself, the 
ciying typically elicits negative or distress feelings 
in nearby adults, which usually leads them 
(paiticularly caregivers) to attempt to aid the 
infant by eliminating the source of the distress. 
Thus, approach behavior is elicited in the adult 
Based on this analysis of adults' respc.ises to 
infant smiles and cries, the four hypotheses 
outlined above regarding cerebral organization for 
emotional processes would each predict a different 
pattern of asymmetries in adults' perception of 
infant emotional expressions. The RH hypothesis 
would predict an overall LVF advantage Uiat is 
unaffected by the valence of the emotion 
expression. The valence hypothesis would predict 
a LVP-RH advantage for perception of infant 
crying expressions, but a RVF-LH advantage for 
perception of infant smiles. TI.9 negative-valence 
hypothesis would instead predict a LVF-!%H 
advantage for crying expressions but a smaller or 
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nonexistent bias for perception of smiles. Finally, 
the motivational hypothesis would most likely 
predict a RVF^LH advantage for perception of 
both infant cries and smiles, because both should 
elicit approach responses in adult viewers. The 
RVP-LH bias mii^t be larger for smiles than for 
cries, to the extent that smiles might elicit a 
stronger approach tendency when the adult 
viewers are not caregivers of the infants depicted 

Experiment 1 investigated these possibilities, 
using a set of photographs of infants' smiling an<^ 
crying expressions. For this purpose, we employee 
the free-field viewing procedure developed by 
Lev/, Heller, Banidi and Burton (1983), in which 
subjects must choose which member of a pair of 
mixed-expressi<m chimeras fbr each of a nu^iber of 
posers appears to be emotionally more intense 
(happier or sadder). Each pair of chimeras 
displays a half-neutral, half-emotional facial 
expression of a given poser; in one chimera, the 
emotional expression appears on the left side of 
the photograph, while in the other it appears on 
the right side. Significant asymmetrical biases in 
the viewers* choices between these pairs of 
chimeras are interpreted as reflecting 
asymmetrical activation of the cerebral 
hemispheres in response to the task, following 
Kinsbourrie*s (1978) proposal that ^'ctivation of a 
one cerebral hemisphere will cause an attentional 
bias (increased perceptual sensitivity) favoring the 
contralateral spatial hemifield. Thus, if chimeras 
with the emotional expression on the left side of 
the photograph are perceived to be more 
emotional (happier or sadder) than those with the 
emotional expression on the right side, a LVF bias 
would be indicated, implying greater activation of 
the RH during the emotion judgment task. The 
converse pattern would indicate a RVF-LH bias 
(cf Grega, Sackeim, Sanchez, Cohen & Hough, 
1988). Uvy et al. (1983) found a LVF-RH bias in 
adults* perception of half-neutral, half-smiling 
chimeras of adult posers* fiices. 

Experiments 2 and 3 were designed to examine 
wheUier tht /alence effect that was found in 
Experiment 1 could be attributed to differences in 
the extent to which judgments about smiling 
versus crying expressions involve the basic 
information processing approaches of the right 
versus the left hemisphere. The left hemisphere 
has been characterized as having a feature- 
orieated, analytical anproach to information 
processing, whereas the right hemisphere 
approach has been described as uolistic, gestalt- 
like, or synihetic (e.g., Braushaw & Nettleton, 
1981; Bryden, 1982; Levy, 1974). An effect of 



valence on asymmetries in the perception of 
emotional expressions might result from greater 
involvement of LH i«;ature-oriented processing for 
one categoiy of expression than for the other (cf. 
Moscovitch, 1983). For example, smiles may be 
perceived by simply focusing on whether the 
comers of the mouth are upturned or not, but 
perception of ciying expressions may require the 
perceiver to attend to the overall configuration of 
mouth, eyes and brows. If so, then manipulating 
the stimulus properties to emphasize a focus on 
specific features should shift the degree and 
direction of visual field bias for judgments about 
negative and positive expressions. Alternatively, 
the valence effect could reflect differences in 
hemispheric specialization for positive and 
negative emotions per se, independent of the basic 
information processing skills of the two 
hemispheres. In the latter case, feature-oriented 
stimulus n^anipulations would not be expected to 
influence the valence effect on perceptual 
asymmetries. These possibilities were explored in 
the last two experiments reported here. 

EXPERIMENT 1 
Method 

Subjects 

Forty-six university students (23 female, 23 
male) were included in this study; all had also 
participated in a related study of asymmetries in 
infants' facial expressions (Best & Queen, 1989). 
All were familial right-handers, a population that 
is more consistently and more strongly later- 
alized than non-right-handers for various 
hemispherically-specialized functions, including 
the perception of emotional expressions 
(Chaurasia & Goswami, 1975; Heller & Levy, 
1981). Subjects completed a handedness checklist 
that assessed degree of hand preference on 10 
common unimanual activities, including writing, 
as well as for the writing hand preference of 
immediate family members. To be considered 
strongly right-handed, subjects had to indicate a 
"stroncf to "moderate'* right-hand preference for 
all items, without ever switching hand preference 
during childhood, and both of their parents had to 
be right-handed. Four additional subjects were 
also tested but later eliminated for failure to meet 
the handedness criteria. All subjects had normal 
or corrected vision. They received $4.00 for their 
participation in the 40 minute test session. 

StimuU 

The stimulus materials were generated from 
photographs of facial expressions by 10 normal. 
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full-term 7- to IS-month-old infants, which were 
originally taken a portrait photographer for a 
series of studies on infant attractiveness 
(Hildebrandt & Fitzgerald, 1978, 1979. 1981). 
They were the same original photographs as those 
used in the Best and Queen (1989) study. Each 
infant provided a neutral facial expression and 
either a clearly negative (i.e., crying) or a clearly 
positive (i.e., smiling^ expression, according to 
ratings obtained in an independent study 
(Hildebrandt, 1983). Four infants had crying 
expressions; Uie other 6 had smiling expressions. 
All photographs were of (ull*fi*ontal facial views. 

Two black-and-white 6x7 inch prints were 
made of each infant's neutral expression, along 
with two prints of the infant's smiling or crying 
expression. For each pair, one photograph was 
printed in normal orientation, and the other in 
mirror-reversed orientation. These were used to 
construct four mixed-expressicm chimeras for each 
infant (see Heller & Levy, 1981). Each print was 
cut down the exact facial midline, defined as the 
line connecting the point midway between the 
internal canthi of the eyes and the point in the 
center of the philtrum just above the upper lip. 
The two normal orientation cHmeras for a given 
infant were made by joining the left half of the 
normal orientation print of the emotional 
expression (smile or cry) with the right half of the 
normal orientation neutral expression, and by 
joining the ri^t half of the emotional expression 
with the left half of the neutral expression. For 
each chimera, the midlines of the hemifaces were 
aligned at the ^es and nose (the mouths often 
could not be exactly aligned because of differing 
degrees of mouth opening; see also Heller & Levy, 
1981), and glued to a backing sheet The mirror- 
reversed chimeras were constructed in like 
manner from the mirror-reversed prints of the 
emotional and neutral expressions. Comparison of 
results with the normal orientation chimeras 
versus the mirror-reversed chimeras allowed for 
assessment of any influence that infant expressive 
asymmetries might have upon the adults' 
responses. 

Before reproduction, each chimera was centered 
behind an oval-shaped mattboard opening the size 
of the average photographed face, in order to 
screen out variations in facial outline and hair 
among the infants. Copies were made on a high- 
quality Kodak photocopier, using a gray-scale 
correction template that produces good resolution 
of photographic images. Each page contained 
either the pair of normal orientation chimeras for 
a given infant, or the pair of mirror-reversed 



chimeras for an infant, appearing one above the 
other. We vsed above-below rather than side-l^- 
side pairings to avoid having (subjects' choices of 
tl'ie more emotional chimera be confounded by 
hemispatial field biases resulting from the 
asymmetrical hemispheric activation that would 
be expected for such a laterally-specialized 
function (Kinsboume, 1978). Each of the 10 
infants was represented on four pages: 1) a 
normal-orientation pair in which the emotional 
expression was on the rij^t side in the top picture; 
2) a normal-orientation pair in which the 
emotional expression was on the left side in the 
top picture; 3) and 4) the paire of mirror-reversed 
chimeras positioned as in items 1 and 2, 
respectively. Thus, there were 40 pages of paired 
chimeras. Test booklets were constructed with 
these pages ordered pseudorandomly, such that 
there were no more than 3 consecutive smiling 
infants or 3 consecutive ciying infants, and no 
consecutive presentations of the same poser. At 
the top of each page one of the following questions 
was printed: "Which infant looks happier?* (for 
smiling-neutral chimeras) or "Which infant looks 
sadder?" (for crying-neutral chimeras). 

Procedure 

Subjects were tested in groups of 5-15 in a quiet 
room. They sat at separate desks and each had a 
copy of the test booklet, with a cover sheet of 
instructions. They circled TOP" or "BOTTOM" on 
each numbered line (1-40) of an answer sheet to 
indicate which member of each pair of chimeras 
looked happier or sadder. Subjects proceeded 
throu^ the booklet at their own pace, one page at 
a time without turning back or directly comparug 
pages. 

Results 

The data were converted to laterality ratios 
according to the formula (R.L)/(R+L), in which R = 
percent of chimera choices with the emotional 
expression on the right side of the picture (i.e., 
RVF preference), L = percent choices with the 
emotional expression on the left side (LVF 
preference), and R+L = 100%. ^ The laterality 
ratios thus range from -1.0 (extreme LVF bias) to 
-►1.0 (extreme RVF bias). These values were then 
entered into a 2 x 2 analysis of variance (ANOVA), 
with the factors of emotion (cry, smile) and 
orientation of the photographs unel in the 
chimera (noma!, mirror-reversed). To determine 
whether the lAiean laterality ratios for each cell, 
and overall, showed a significant perceptual 
a83anmetry (i.e., differed significantly from a score 
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of 0 laterality), /-tests were conducted. The alpha 
level correction for multiple /-tests (o = 
.05/number off-tests) required a significance level 
of p < .007 for the latter tests. 

There was a sigriificant LVF bias overall (see 
Table 1 for all mean laterality ratios and /-tests). 

Table L Laterality ratios for the perception of mixed- 
expression chimeras of smiting and crying itifants in 
normal and mirror-reversed orientation. Experiment L 

LatcnUty 

Rttio* fvaSu^ 9 



Effects 








Overall pcrcqxual bias 




-4.78 




Emotion effect 








Smile 




-257 


n 


Qy 


19 


AM 


.0000 


Orienution effect 








Nomud orientstioii 


-57 


-17J4 


.CQOO 


MoTor^eversed 


+J1 




.0000 


Emotion x Orientation interaction 








Sndle 








Nonnal orientation 


-.72 


-2207 


.0000 


MifTor-revened 


+59 


15^ 


.000^ 


Qy 








Normal orientation 




-7^ 


.0000 


Minor-ievcrsed 


+.03 


i4 


ilS 



KTcmiputed u [R-L]/[R+L], wKeire R > percent choices with 
emotional expression on right of chimera, L « nercent 
choices with emotional expression on leift, snd R + L « 
100%. Negative scores indicate a left visual field biu. 
posidve scores a ri^t field bias. 

hOne-san^le r-tesu of whether the mean laterality ratio wu 
significantly greater than 0, indicating a significant 
perceptual asymmetry^ 

However, a significant main effect of emotion, 
^(1,45) = 10.09, p < .003, indicated that the 
valence of the infant expressions influenced the 
degree of asymmetry in the adults* perception of 
the emotionality of he chimeras. Specifically, the 
LVF bias was significant for judgments of crying 
infants, but not for smiling infants. In addition, 
the orientation effect, F(i 45) x 366.68, p = .0000, 
revealed that the adult subjects* perceptual Inases 
were quite sensitive to asymmetries in the infants' 
expressions themselves. The normal orientation 
chimeras, in which the infants* more expressive 
right hemiface (Best & Queen, 1989) appeared in 
the viewers* LVF, yielded a significant LVF bias. 



In contrast, the mirror-reversed chimeras, in 
which the infants* right hemiface appeared in the 
RVF, yielded a smaller but significant RVF bias. 
Finally, there was a significant Emotion x 
Orientation interaction, F(i,45) = 66.80, p » .0000 
(see Table 1). For the smiling infants, there was a 
large difference in laterality ratios between the 
normal orientation chimeras, which showed a 
strong LVF bias, and the mirror-reversed 
chimeras, which showed a strong RVF bias. The 
viewers* perceptual asymmetries were less 
strongly influenced by orientation of the crying 
infant photos, showing a more moderate LVF bias 
for nonnal orientation chimeras and a very small, 
nonsignificant RVF bias for mirror-reversed 
chimeras. Simple effects tests of the interaction 
indicated that the orientation effect was 
significant, nonetheless, for both the crying 
expressions, ^(1,45) » 28.25, p » .0000, and the 
smiling expressions, ^(1,45) = 703.32, p = .0000. 
Furthermore, the emotion effect was significant 
for both normal orientation chimeras, F(i,45) = 
24.14, p s .0000, and mirror-reversed chimeras, 
^(1,45) = 63.29, p = .0000. 

DISCUSSION 

Consistent with the hypothesis that cerebral 
asymmetries in emotional processes are 
influenced by emotional valence, the viewers in 
Experiment 1 showed a significant LVF bias in 
perception of negative biit not positive infant 
emotional expressions. These results are 
compatible with the negative-valence hypothesis 
of cerebral organization for emotional processes 
(e.g., Ehrlichman, 1988). They do not as strongly 
support the valence hypothesis (e.g., Silberman & 
Weingartner, 1986; Tucker, 1981), because the 
LVF-RH bias in perception of negative infant 
expressions was not complemented by a RVF-LH 
bias in perception of positive expressions. The 
results also fail to support the motivational 
hypothesis of emotion lateralization (e.g., 
Davidson, 1984) in that infa^.c cries and smiles did 
not yield a RVF-LH advantage, although both 
types of infant expression would be expected to 
elicit approach responses from adult viewers. The 
results also stand in contrast to the overwhelming 
minority of studies favoring the RH hypothesis, 
which have found a LVF advantage in adults* 
perception of adult facial expressions that is 
unaffected by valence (e.g., Bryden, 1982; Bryden 
& Ley, 1983). Only three studies on the perception 
of adult facial expressions have found a valence 
effect in neurologically intact adults, and these 
both involved tachistcscunic presentations 
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(Natale, Gur, & Gur, 1983; Reuter-Lorenz & 
Davidson, 1981; Reuter-Lorenz, Givis & 
Moseovitch, 1983). Thus it is remarkable that the 
perception of infant expressions showed a 
consistent valence effect even with the free-field 
task used here. 

It was suggested earlier that a valence effect 
should be optimized the heightened emotional 
response that infant expressions elicit from adults 
according to ethological theory. Indeed, informal 
observations revealed that many of the subjects 
smiled or showed other positive emotional 
responses to the infant faces during the test, 
whereas none of them sho sired such responses 
while completing a similar test with chimeric 
adult expressions during the same session. The 
suggestion of heightened sensitivity to infant 
emotional expressions is further corroborated by 
the finding that the perceptual responses of our 
adult subjects were strongly influenced by the 
infants' own expressive asymmetries. In contrast. 
Levy et al. (1983) failed to find significant 
evidence of perceivers' sensitivity to the 
e iressive asymmetries of their adult posers in a 
similar free-field study using adult chimeric 
expressions. 2 In the present experiment, the 
viewers' sensitivity to asymmetrical information 
in the infant expressions was great enough to 
reverse their perceptual field asymmetries from a 
LVF bias when the more expressive right 
hemiface of the infants was on the left side of the 
chimera, to a smaller but significant RVF bias 
when the infants' right hemiface was on the right 
side of the chimera. 

The interaction between the valence of the 
infants' emotional expressions and the orientation 
of the chimeras indicates that the relative impact 
of the viewers' perceptual biases and the infants' 
expressive asymmetries differed between 
judgments of negative and positive expressions. 
The pattern of this interaction provides additional 
support for the negative-valence effect. Although 
the lefk/right position of the infants' more 
expressive right hemiface within the chimeras 
influenced perception of both types of expressions, 
it was a much weaker determinant of judgments 
about cries than about smiles. That if infant 
expressive asymmetries had "Relatively less 
influence, and the viewers' LVF-RH bias rdatively 
more influence, in the adults' responses to cries 
than to smiles. 

Thus, Experiment 1 provided clear evidence of a 
negative-valence effect on asymmetries in adults' 
perception of infant emotional expressions. 
However, it did not elucidate the underlying 



perceptual processes responsible for the 
phenomenon. As suggested in the introduction, 
one possible source of the effect might be that 
negative expressions are perceived in terms of the 
configuration of the whole face (i.e., the gestalt of 
the features within the "^ame" of a face outline 
and hair), whereas perception of positive 
expressions may focus upon the mouth as a single 
distinguishing feature (Moscovitch, 1983). The 
former approach would call more heavily upon the 
holistic d)ilities of the right hemisphere, while the 
latter approach would be more suited to the 
feature-oriented analytic abilities of the left 
hemisphere (e.g., Bradshaw & Nettleton, 1981; 
Levy, 1974). If the influence of emotional valence 
is attributable to such differences *n the 
perceptual approach to crying and smiling 
expressions, then the negative-valence effect, and 
indeed the overall LVF bias, should become 
attenuated when the viewers' attention is 
progressively restricted to narrower sources of 
emotional information in the infants' faces, such 
as the patterning of the central facial features 
without the contextual *frame" of thu facial 
outline, hair, and other peripheral details. This 
manipulation should lead subjects to use a more 
feature-oriented, analytic approach, and thus to 
rely more heavily on left hemisphere information 
processing strategies. Alternatively, it may be thr 
the viewers' actual emotional responses to crying 
and smiling infants, rather than the information 
processing strategy, are responsible for the 
valence effect If so, then the negative-valence 
effect and the overall perceptual field bias should 
appear even when the viewer's attention is 
focused away from the gestalt of the faces and 
toward subcomponents or specific features. The 
next two experiments were designed to 
systematically examine these possibilities. 

EXPERIMENT 2 
If the holistic or gestaltlike perceptual 
specialization attributed to the right hemisphere 
is responsible for the finding of a LVF for crying 
but not smiling expressions, we would expect that 
perception of cries focuses on the gestalt of the 
whole face, i.e., the patterning of facial features 
within the configurational "frame* of the face. If 
the crying expressions evoked a greater degree of 
right hemisphere involvement because th^ were 
perceived in a more holistic manner, then removal 
of the peripheral configurational information such 
the facial outline should attenuate or eliminate 
the negative-valence effect If, on the other hand, 
the valence effect derives from the emotional 
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nature of the response to the infant expressions, 
then it should persist even for judgments of the 
central facial features alone. 

To restrict the viewers' attention to the 
patterning of the central features of the 
eyes/brows, mouth and nose, we deleted the 
unwanted peripheral details (e.g., face outline, 
hair, cheeks) from optically-dig7tized versions of 
the original photographs via computer. A new 
group of subjects made choices between each pair 
of the mixed-expression chimeras generated from 
these computer-edited infant expressions, as in 
Experiment 1. 

Method 

Subjects 

Ninety-six familial right-handed university and 
high school students (51 female, 45 male) 
participated in Experiment 2. All had normal or 
corrected vision, and all had participated in the 



Best and Queen (1989) study. The university 
students received $4.00 for their participation in 
the 45 minute session; the high school students 
were unpaid volunteers. 

Stimuli 

High-quality photocopies of the original 
photographs from Experiment 1 were computer- 
digitized and edited, using an Apple^ Macintosh™ 
512-*- computer (see Best and Queen, 1989, for 
details). The cheeks, ears, chin, hair, and face 
outline were removed frt)m the digitized pictures, 
and the resulting edited images were printed in 
both normal and mirror-reversed orientation. 
These were then used to generate mixed- 
expression chimeras of each infant (see examples. 
Figure 1), which were assembled into, a 40-page 
test booklet and reproduced with a high-qualiiy 
photocopier, as in Experiment 1. 







SMILE 



CRY 



Figure 2. Examples of digitized mixed-expression chimeras of a smiling and a ciying infant, with the emotional 
expressions on the left versus the right side of the picture. Experiment 2. 
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Procedure 

Subjects completed the test booklet under the 
same conditions and instructions as in 
Experiment 1. 

Results 

The data were transformed to laterality ratios 
and analyzed as in Experiment 1. The significance 
level for the multiple /-tests was set at p < = .007, 
as before. 

Again, there was a significant overall LVF bias 
in perception of the emotional chimeras (see Table 
2 for mean laterality ratios and r-tests). The 
magnitude of this LVF bias did not change 
significantly from that fotind in Experiment 1. The 
emotion effect was significant, as in Experiment 1, 
^(1,95) * 7.76, p < .007, again indicating a 
stronger LVF bias for crying expressions than for 
smiling expressions. The magnitude of the 
difference in visual field biases between the crying 
and the smiling infants did not differ significantly 
from those found in Experiment 1, according to 
Mest. The orientation effect was also significant, 
^(1,95) = 432.01, p = .0000, indicating that the 



infant's expressive asymmetries affected the 
viewers' judgments. There was a significant LVF 
bias when the infants' more expressive right 
hemiface was on the left side of the chimeras, but 
a RVF bias when it was on the right side. The 
Emotion x Orientation interaction was also 
significant, F(i^95) = 63.64, p = .0000, following 
the pattern found in Experiment 1. For the 
smiling infants, the normal orientation chimeras 
showed a strong LVF bias, and the mirror- 
reversed chimeras showed a strong RVF bias (see 
Table 2). The orientation of the crying infant 
photos showed a similar but weaker influence, 
yielding a moderate LVF bias for normal 
orientation chimeras, and a much smaller RVF 
bias for mirror-reversed chimeras. According to 
simple effects tests of this interaction, 
nevertheless, the orientation effect was significant 
for both the crying expressions, F( 1^95) = 13.23, p 
< .0005, and the smiling expressions, ^(1^95) = 
65.76, p = .0000, and the emotion effect was 
significant for both normal orientation chimeras, 
^(1,95) = 559.26, p = .0000, and mirror-reversed 
chimeras, F( 1^95) = 104.62,p = .0000. 



Table 2. Laterality ratios for the perception of mixed-expression chimeras of smiling and 
crying infants in normal and mirror-reversed orientation. Experiment 2. 



Laterality 
Ratlo<> 



rvalui^ 



Efiects 
Overall perceptual bias 
Emotbn effect 

Smile 

Qy 

Orientation effect 

Nomial orienution 

MiiroT-revcrscd 
Emotion x Orientation interaction 
* Smile 

Nonnal orientation 

MiiroT-rcvcrsed 
Qy 

Nonnal orientation 
Mimw-rcvcrsod 



-.11 

-.07 
-.15 

-.49 
+.26 



-i6 
+.42 

-.41 
+.11 



-5^3 

-3.1c 
-5.62 

-19.13 
8.83 



•19.66 
1234 

-11.08 
2.94 



.0000 

.003 
.0000 

.0000 
.0000 



.0000 
.0000 

.0000 
.005 



■Computed as [R-LJ/IR+L], where R =» percent choices with emotional expression on right of 
chuncra, L a percent choices with emotional expression on left, and R + L s 100%. Negative 
scores indicate a left vbual field bias, positive scores a right field bias. 

nDne-sample f -tests of whether the mean laterality ratio was significantly greater than 0, 
indicating a significant perceptual asymmetry. 
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DISCUSSION 

The results of Experiment 2 replicated those of 
Experiment 1, even though the gestalt of the 
whole faces had been modified by removal of the 
facial outline and of extraneous details other than 
the pattern of the central facial features. In fact, 
the magnitude of the effects failed to differ 
significantly from those found in Experiment 1. 
These findings suggest that the viewers' 
perception of the more complete chimeric 
photographs in the previous study had been based 
upon the central facial features rather than on 
their relation to peripheral information such as 
the '^rame'" of the facial outline and hair. It also 
suggests that the negative-valence effect is due to 
the emotional nature of the perceptual response to 
the faces, rather than to differential involvement 
of the right hemisphere's putative holistic 
approach and the left hemisphere's putative 
feature-analytic approach to negative vs. positive 
expressions, respectively. 

Perhaps, however, the stimulus manipulations 
of Experiment 2 did not provide sufficient 
interference with the gestalt of the facial 
expressions to disrupt the right hemisphere's 
greater holistic response to crying than to smiling 
expressions. A clearer disruption of the holistic 
approach would involve focusing the viewers' 
attention on even more narrowly-defined features 
of the faces, such as the mouth or the eye region, 
both of which would be expected to carry much of 
the information conveyed in an emotional 
expression. The third experiment investigated the 
possibility that such a narrow feature-oriented 
approach would influence the negative-valence 
effect 

EXPERIMENTS 

The purpose of the third experiment was to 
determine the effect of restricting the viewers' 
attention to the emotional expression displayed in 
single, isolated facial features, which should more 
definitely bias the perceptual approach toward the 
analytic, feature-oriented abilities ascribed to the 
left hemisphere. If information processing 
differences in the perception of smiles as 
compared to cries were responsible for the 
negative-valence effect in the earlier studies, then 
this manipulation should either eliminate the 
valence effect in perceptual as3rmmetries or shifl it 
to a strong RVF bias for smiling expressions with 
a weak or nonexistent LVF bias for crying 
expressions. However, if the negative-valence 



effect arises instead from the emotional nature of 
the judgments, then the pattern of findings and 
the magnitude of the effects should be impervious 
to this manipulation. 

In restricting viewers' attention to specific facial 
features, we focused on the expressive patterning 
of the mouth versus that of the the eyes. In a 
previous paper (Best & Queen, 1989), we had 
found that the infants' right hemiface bias in 
expressiveness was specific to the mouth region of 
the face, and was not present in the eye region, 
even though the viewers had no difficulty making 
emotionality judgments about pairs of chimeras 
generated from either of tnese isolated facial 
regions. Because cortical input to the mouth 
region is contralateral, whereas input to the eye 
region is bilateral, those earlier results had 
suggested that lateralized cortical specializations, 
rather than more peripheral factors, are 
responsible for the right hemiface bias in infant 
expressions. Thus, a second purpose of the present 
experiment was to test whether adults' perceptual 
asymmetries are influenced by the difference in 
asymmetrical patterning between the ^e and the 
mouth regions of the infants' expressions. For 
Experiment 3, a new group of judge£ was 
presented with an ^pper face" test and a ^ower 
face" test that employed further modifications of 
the digitized, edited infant expressions developed 
in Experiment 2. 

Method 

Subjects 

Fifty-four familial right-handed university 
students (27 female, 27 male) participated in this 
experiment. They received $4.00 for participating 
in a 45 minute session. All had normal or 
corrected vision, and all had participated in the 
Best and Queen study (1989). 

Stimuli 

The digitized, edited faces from Experiment 2 
were again revised to produce an ''upper face" test, 
for which all facial features other than the eyes, 
brows and bridge of the nose wero removed , and a 
"lower face" test, for which all features other than 
the mouth and the tip of the nose were eliminated. 
Mixed-expression chimeras were generated 
separately for the eyes/brows and for the mouth 
(see Figure 2).3 Two 40-page test booklets were 
constructed as in the previous experiments, one 
for the "lower face" test and one for the "upper 
face" tes^. These were duplicated on a high* quality 
photocopier, as before. 
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SMILE CRY 

Figure 2. Example of digitized mixed-expression chimeias of the eye xegion and the mouth region of a smiling and a 
dying infant, with the emotional expressions on the left versus the right side of the picture, &q>erimcnt 3. 



Procedure 

The subjects were tested under the same 
conditions and instructions as in the first two 
experiments. They each completed the lower face" 
test first and the "upper face ' cest second. Pilot 
testing suggested that judgments about the 
eyes/brows might be more difficult than 
judgments about the mouth; this test order thus 
allowed more practice before the more difficult 
test. 

Results 

The data were handled as in Experiments 1 and 
2, except that the ANOVA for this experiment 
included a third, new factor: face part (mouth vs. 
eyes), and the alpha level adjustment for multiple 
^tests set the significance level at p < = .002. 

Once again, there was an overall LVF bias (see 
Table 3 for mean laterality ratios and /-tests). The 
niagnitude of this LVF bias did not differ 
significantly from that found in either of the 
previous experiments. The emotion effect was 
again significant, ^(1,53) = 13.96, p < .0005, 



indicating that the valence of the infant emotion 
continued to influence the viewers' perception of 
the expressions, even when their attention was 
restricted to isolated facial features. The crying 
infants elicited a significant LVF bias, which 
reversed to a nonsignificant RVF bias for the 
smiling infants. The magnitude of this valence 
effect again failed to diflFer significantly from that 
found in Experiments 1 and 2. As in the first two 
experiments, there was also a significant 
orientation effect, ^(1^53) = 39.70, p = .0000. 
There was a LVF bias only for judgments of 
normal orientation expressions^ when the infants' 
right hemiface was on the left side of the 
chimeras. The mirror-reversed chimeras elicited a 
small, nonsignificant RVF bias. The Emotion x 
Orientation interaction was significant as well, 
^(L33) = 8.97,p < .004. Also as before, orientation 
haa a smaller effect on perceptual asymmetries in 
response to crying than to smiling expressions. 
There was a LVF bias for crying infants, which 
was significant for the normal orientation 
chimeras, but not for mirror-reversed chimeras. In 
contrast, the normal orientation smiling infants 




213 



206 



Best and Queen 



evoked an even larger LVF bias, but the mirror- 
reversed smile expressions produced a significant 
RVF bias. According to simple effects tests of this 
interaction, the orientation effect was significant 
for both the ciying expressions, F(i,53) = 5.13,p s 



.02, and the smiling expressions, F(1^3) = 76.97, 
p < .0000. However, the emotion difference was 
significant only for the mirror-reversed chimeras, 
^'(1,53) = 21.28, p < .0000, and not for the normal 
orientation chimeras. 



Table 3. Laterality ratios for the perception ofmixed-expression chimeras of smiling 
and crying infants in normal and mirror-reversed orientation, eye versus mouth. 
Experiments, 



Laterality 
RatioB 



Summiirv Steti<rti« 



fvalui?> 



Effects 
Overall perceptual bias 
Emotion effect 

Smile 

Qy 

Orientation effect 

Nonnal oiientation 
h^mv^revened 
Emotion x Orientation interaction 
Smile 
Nonnal orientation 
Miiror-reversed 
Qy 

Normal orientation 
MiiTor-reversed 
Face part x Orientation interaction 
Mouth region 

Normal orientation 
Mirror-reversed 
Eye region 

Ndmial orientation 
Mirror-reversed 
Face part x Orientation x Emotion interaction 
Mouth region 
Smile 
Normal orieroation 
Mirror-reversed 

Qy 

Normal orientation 
Mirror-reversed 
Eye region 
Smile 
Normal orientation 
Mirror-reversed 

Qy 

Normal orientation 
Mirror-reversed 



-.08 

-.02 
-.14 

..19 
+.03 



-.17 
+.14 

-.20 
-.08 



-35 
+.20 

-.03 
-.14 



+i2 

-.13 
-.11 



+.23 
-.24 

-.28 
-.05 



444 

-0^ 
-5.14 

-«47 
1.09 



-6S9 
4.79 

-5j62 
-1-85 



-1031 
4.93 

-1.10 
-3J98 



-1731 
11.92 

-232 
-153 



5.672 
455 

-623 



.0000- 

IB 

.0000 
.0000 

IB 



.0000 
.0000 

.0000 

IB 



.0000 
.0000 

m 

.0002 



.0000 
.0000 

IB 
IB 



.0000 
.0000 

.0000 

IB 



^Computed as [R-L]/[R+L], where R s percent choices with emodonal expression on right of 
chimera, L « percent choices with emotional expression on left, and R + L s 100%. 
Negative scores indicate a left visual field bias, posidve scores a right field bias. 

^One-sample /-tests of whether the mean laterdity ratio was significantly greater than 0, 
hidicating a significant perceptual asymmetry. 
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The new face part factor also entered into two 
significant interactions. The Pace t ^rt x 
Orientation interaction, F(i;53) = 101.86, p = 
.0000, indicated that the infants' expressive 
asymmetries had a greater influence on 
perception of the mouth than of the eyes. The 
normal orientation chimeras of the mouth yielded 
a large LVP bias in perception, while the mirror- 
reversed versions yielded a large RVP bias. In 
contrast, the eye chimeras produced a smaller but 
significant LVF bias when presented in mirror- 
reversed orientation, which became nonsignificant 
for the normal orientation eyes. Simple effects 
tests of the Pace part x Orientation interaction 
showed that the face part difference was 
significant for both the normal orientation 
chimeras, F(i 53) = 62.39, p = .0000, and the 
mirror-reversed ones, ^(1^53) = 44.88, p < .0000. 
Moreover, the orientation effect was significant 
both for the mouth, ^(1^35) = 119.58, p = .0000, 
and for the eyes, F( 1^53)= 6.68, p < .01. 

The Pace part x Orientation x Emotion 
interaction v^as also significant, ^(1^53) = 165.97, 
P = .0000. The results for the mouth region 
chimeras Ibllowed the pattern found in the earlier 
experiments. The smiling mouths produced a 
large LVP bias for normal orientation chimeras, 
and a large RVP bias for mirror-reversed ones. Yet 
orientation had no appreciable effect on perception 
of the crying mouths, which showed a 
nonsignificant LVP bias for both normal 
orientation and mirror-reversed chimeras. The 
crying eyes yielded a LVP bias that was 
significant for normal orientation chimeras, but 
not for mirror-reversed ones, also consistent with 
the earlier findings. The smiling eyes, however, 
elicited a RVP bias for normal orientation 
chimeras, but a LVF bias for mirror-reversed 
chiineras. The direction of this orientation effect 
for judgments of smiling eyes was the opposite 
from the pattern of the Emotion x Orientation 
interactions found in Experiments 1 and 2, where 
the normal orientation was associated with LVF 
bias and the mirror-reversed orientation with 
RVP bias. Thus, the smiling eyes appear to be 
responsible for the Face part x Orientation 
interaction. Nonetheless, it is still clear that, for 
judgments of both the mouth and the eyes, 
orientation had a greater eflTect on perceptual 
responses toward the smiling expressions than 
toward the crying expressions. According to 
simple effects tests, the orientation effect was 
significant for crying eyes, ^(1^53) = 53.48, p = 
.0006, for smiling mouths, ^(1^53) = 457.26, p = 
.0000, and for smiling eyes, ^(1^53) = 11.06, p < 



.002, but not for crying mouths. Also, the emotion 
effect was significant for eyes in normal 
orientation, ^(1^53) = 56.96, p = .0000, and in 
mirror-reversed orientation, ^(1^53) = 9,72, p < 
.003, as well as for mouths in normal orientation, 
^(1,53) = 58.74, p = .0000, and in mirror-reversed 
orientation, F( 1^53)= 90.83, p = .0000. 

DISCUSSION 

The significant emotion effect was not 
diminished relative to the two earlier 
experiments, in spite of restricting the viewers' 
attention to isolated facial features. This result 
strongly suggests that the negative-valence effect 
on asjonmetries in perception of infant emotional 
expressions derives from the emotional nature of 
the judgments rather than from differences in the 
information processing of negative versus positive 
expressions. Moreover, differences in perception of 
the eye and mouth regions suggest that the 
viewers were sensitive to differences in the 
expressive asymmetries displayed by those facial 
regions. Consistent with the Best & Queen (1989) 
finding that the right hemiface bias in infant 
expressions was significant only for the mouth 
region, the viewers in the present study were 
more affected by the orientation of the mouth than 
by the orientation of the eyes. For both the eyes 
and the mouth, however, the negative-valence 
effect held up (the Face part x Emotion interaction 
was not significant), in that there was a weaker 
orientation effect, or greater effect of perceptual 
asymmetry, for crying expressions than for 
smiling expressions. That is, the LVF bias in 
perception of infant emotional expressions was 
stronger (less affected by the infants' expressive 
asymmetries) for crying than for smiling 
expressions, suggesting greater right hemir >here 
involvement in perception of negative than in 
positive emotions. 

GENERAL DISCUSSION 
The results of all three experiments support the 
hypothesis that the right hemisphere is 
specialized for perception of negative emotion, but 
that perception of positive emotion is less strongly 
lateralized, a view we have referred to as the 
negative-valence hypothesis (e.g., Ehriichman, 
1988). The valence hypothesis of RH specialization 
for negative emotion, but LH specialization for 
positive emotion, was not supported, in that 
perception of infant smiles failed to show a 
significant RVF-LH bias. Support was also lacking 
for the motivational hypothesis, because the 
approach response that adults would be expected 
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to $how toward both crying and smiling infants 
did not result in e RVF-LH perceptual bias, as 
predicted. 

Given that tie majority of studies on 
asymmetries in perception of adults' facial 
expressions have instead supported the RH 
hypothesis — that the right hemisphere is 
specialized for all emotion, regardless of valence — 
the present findings suggest that the influence of 
valence on perceptual asymmetries may depend 
on some involvement of emotional responses in 
the viewer. Our task, like those used in other 
reports that have supported variants of the 
valence hypothesis, called for judgments about the 
emotiondity of the stimulus expressions. In 
contrast, discrimination or recognition judgments 
were required by many of the studies that 
obtained an overall RH advantage for emotion 
perception. 

However, judgments about emotionality may not 
alone suffice to produce a negative-valence effect 
on perceptual asymmetries. Levy et al. (1983) 
presented subjects vrith pairs of mixed-expression 
chimeras of half-neutral, half-smiling adult faces 
in a free-field task and asked for judgments about 
the relative emotionality of each pair, as in the 
present study, yet those researchers obtained a 
significant LVF-RH advantage for perception of 
positive emotion. Our use of infant facial 
expressions to increase the likelihood of the 
viewers' emotional response to the stimuli may 
have been important in obtaining an influence of 
valence on perceptual asymmetries. This 
suggestion is corroborated by our finding that the 
subjects in Experiments 1 and 2 showed a strong, 
significant LVF bias in their perception of the 
Levy et al (1983) adult chimeric smiling 
expressions (Best & Queen, 1987), whereas they 
had instead shown a nonsignificant (Experiment 
1) or small LVF bias (Experiment 2) in their 
perception of infant chimeric smiling expressions. 
Even stronger support is provided by Chaiken 
(1988), who used the same fVee-fi3M chimeric face 
technique to compare aduics' perceptual 
asymmetries for smiling and crying adult 
expressions versus smiling and crying infant 
expressions. Her subjects showed a significant 
valence effect in response to the infant 
expressions, but no valence effect in response to 
the adult expressions. It should be noted, 
however, that the viewers' actual emotional 
resporises to the infant and adult stimuli were not 
directly assessed in any of these studies. Thus, 
further research is needed to test the hypothesis 
that emotional responses in the viewer are crucial 



in producing a valence effect on asymmetries in 
the perception of emotional expressions. 

The results of experiments 2 and 3 also indicate 
that the negative-valence effect on perception of 
infant expressions cannot be explained by 
differences in the balance between the basic 
information processing approaches of the two 
hemispheres during the perception of negative 
versus positive emotions. Although stimulus 
manipulations designed to focus the viewers' 
attention on progressively more restricted 
features of the infant facial expressions might 
have shifted perception toward the putative 
analytical, feature-oriented approach of the LH, 
these manipulations did not influence the overall 
degree of asjrmmetry in emotion perception. Nor, 
more importantly, did the manipulations change 
the magnitude of the valence effect on perception. 
These findings thus suggest that the negative- 
valence effect in perception of infant emotional 
expressions reflects an aspect of hemispheric 
specialization that is independent of information 
processing asymmetries. The most likely basis for 
this separate characteristic of hemispheric 
specialization is an asymmetry in emotional 
responsiveness to stimuli, such that the RH shows 
greater sensitivity in the perception of negative 
emotion. 

It is interesting to note that the adult's LVF-RH 
bias in perception of infants' crying expressions is 
compatible with recent findings that emotional 
expressions are more intense on the right side of 
the infant'« face (Best & Queen, 1989; Rothbart, 
Taylor & Tucker, 1989). That is, in face-to-face 
interactions, the infant's more expressive 
hemiface would appear in the adult's more 
sensitive visual hemispatial Held (see 
Introduction; also Kinsboume, 1978), presumably 
increasing the likelihood of the adult's emotional 
response to the infant. This pattern of spatial 
compatibility between perceiver and producer of 
an emotional expression does not hold in the case 
of adults interacting face-to-face with other 
adults. Adults instead show a left hemiface bias in 
expressiveness, which would place the more 
expressive hemiface in the viewer's less sensitive 
hemispatial field. The enhancement of adults' 
sensitivity and responsiveness to infant 
expressions, relative to adult expressions, is 
consistent with ethological theory, as discussed in 
the general introduction. But why should there be 
greater compatibility between infant expressive 
asymmetry and adult perceptual asymmetry in 
the case of crying expressions than of smiling 
expressions? Perhaps this can be related to 
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differences in the imperativeness of adult 
responses to infant distress and pleasure states. 
Presumably, infant distress may indicate some 
possible danger to the infant, requiring immediate 
action on the part of the caregiver or other adult, 
whereas an infant's smile does not signal the need 
for such immediate action. Therefore, the 
evolutionary pressure for enhanced respon- 
siveness to infant crying expressions would have 
been greater than the pressure for enhanced 
responsiveness to infant smiles. 
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FOOTNOTES 

^Also Wesleyan University, Middletovm, CT. 
^HVesleyan University. 

^Performance levd conectiona such as ^e Phi ooeffident (Kuhn, 
1973) or X (Bryden 6c Sprott, 1981) are ndthcr neceaaary nor 
applicable with binary forced-choice data. 

^However, this failure may be du^ in part to the manner in 
whidi thoae researchers paired their adult chimeraa. Whereas 
we always paired the two normal orientation chimeras or the 
two mirror-reverMd chimeras of a given Ment (or forced- 
dioice judgments. Levy et al. 0983) had always paired a 
normal orienution diimera %vith its own mirror-image. Their 
approach may have masked their vie%vera' aenaitivity to poser 
asymmetries. In an unpublished extension of their study we 
modified the pairings of the Levy et aL adult chimeraa such 
that both members of cadi pair were in the same orientation 
(best 6c Queen, 1987). Under those conditions we found a 
significant influence of poser asymmetries upon the viewers' 
perceptual field biases. Even in our extension of Levy et aL, 
nonethelesa, the influence of adult poeer asymmetries upon the 
viewers' percepi^jd biases waa very much smaller than that 
found ivith our infant facea. It should also be noted that the 
Levy et al. sdmuli induded only snUling expressions, which 
yielded the largest influence of poser asymmetries for infant 
expresdona in the present study. 

^The eye and mouth regions were not separated from one 
another until after *ihe midline had been drawn on each face 
from the point midway between the internal canthi and the 
center of the philtrum, as in Experiment 1. 
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